[jira] [Updated] (ARROW-16614) [C++] Use lz4::lz4 for lz4's CMake target name
[ https://issues.apache.org/jira/browse/ARROW-16614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16614: --- Labels: pull-request-available (was: ) > [C++] Use lz4::lz4 for lz4's CMake target name > -- > > Key: ARROW-16614 > URL: https://issues.apache.org/jira/browse/ARROW-16614 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Because upstream uses {{lz4::lz4}} not {{LZ4::lz4}}. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16614) [C++] Use lz4::lz4 for lz4's CMake target name
Kouhei Sutou created ARROW-16614: Summary: [C++] Use lz4::lz4 for lz4's CMake target name Key: ARROW-16614 URL: https://issues.apache.org/jira/browse/ARROW-16614 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Kouhei Sutou Assignee: Kouhei Sutou Because upstream uses {{lz4::lz4}} not {{LZ4::lz4}}. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-11135) [Java][Gandiva] Using Maven Central artifacts as dependencies produces runtime errors
[ https://issues.apache.org/jira/browse/ARROW-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-11135: - Fix Version/s: 7.0.0 (was: 8.0.0) > [Java][Gandiva] Using Maven Central artifacts as dependencies produces > runtime errors > - > > Key: ARROW-11135 > URL: https://issues.apache.org/jira/browse/ARROW-11135 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 2.0.0, 3.0.0 >Reporter: Michael Mior >Assignee: Anthony Louis Gotlib Ferreira >Priority: Major > Fix For: 7.0.0 > > > I'm working on connecting Arrow/Gandiva with Apache Calcite. Overall the > integration is working well, but I'm having issues. As [suggested on the > mailing > list|https://lists.apache.org/thread.html/r93a4fedb499c746917ab8d62cf5a8db8c93a7f24bc9fac81f90bedaa%40%3Cuser.arrow.apache.org%3E], > using Dremio's public artifacts solves the problem. Between two Apache > projects however, there would be a strong preference to use Apache artifacts > as a dependency. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-11135) [Java][Gandiva] Using Maven Central artifacts as dependencies produces runtime errors
[ https://issues.apache.org/jira/browse/ARROW-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-11135. -- Fix Version/s: 8.0.0 Resolution: Fixed Thanks for confirming this! I close this. > [Java][Gandiva] Using Maven Central artifacts as dependencies produces > runtime errors > - > > Key: ARROW-11135 > URL: https://issues.apache.org/jira/browse/ARROW-11135 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 2.0.0, 3.0.0 >Reporter: Michael Mior >Assignee: Anthony Louis Gotlib Ferreira >Priority: Major > Fix For: 8.0.0 > > > I'm working on connecting Arrow/Gandiva with Apache Calcite. Overall the > integration is working well, but I'm having issues. As [suggested on the > mailing > list|https://lists.apache.org/thread.html/r93a4fedb499c746917ab8d62cf5a8db8c93a7f24bc9fac81f90bedaa%40%3Cuser.arrow.apache.org%3E], > using Dremio's public artifacts solves the problem. Between two Apache > projects however, there would be a strong preference to use Apache artifacts > as a dependency. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-11135) [Java][Gandiva] Using Maven Central artifacts as dependencies produces runtime errors
[ https://issues.apache.org/jira/browse/ARROW-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539295#comment-17539295 ] Jonathan Swenson commented on ARROW-11135: -- I may be missing something, but It appears as though this solution does resolve the issue. Upgrading to arrow-gandiva 8.0.0 (haven't tried 7.0.0) from maven central appears to work on intel macs, but fails with a different linker error when running on an m1 mac (apple silicon). Filed https://issues.apache.org/jira/browse/ARROW-16608 to track this additional issue, but I believe that this particular is solved. > [Java][Gandiva] Using Maven Central artifacts as dependencies produces > runtime errors > - > > Key: ARROW-11135 > URL: https://issues.apache.org/jira/browse/ARROW-11135 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 2.0.0, 3.0.0 >Reporter: Michael Mior >Assignee: Anthony Louis Gotlib Ferreira >Priority: Major > > I'm working on connecting Arrow/Gandiva with Apache Calcite. Overall the > integration is working well, but I'm having issues. As [suggested on the > mailing > list|https://lists.apache.org/thread.html/r93a4fedb499c746917ab8d62cf5a8db8c93a7f24bc9fac81f90bedaa%40%3Cuser.arrow.apache.org%3E], > using Dremio's public artifacts solves the problem. Between two Apache > projects however, there would be a strong preference to use Apache artifacts > as a dependency. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (ARROW-16609) [C++] xxhash not installed into dist/lib/include when building C++
[ https://issues.apache.org/jira/browse/ARROW-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539285#comment-17539285 ] Alenka Frim edited comment on ARROW-16609 at 5/19/22 4:47 AM: -- Also [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h] would be needed in {{dist/include}} for [{{arrow/python/python_to_arrow.cc:44}}|https://github.com/apache/arrow/blob/1cdedc4cbf0709ce440d69242afd47474a7148c7/cpp/src/arrow/python/python_to_arrow.cc#L44] and {{arrow/vendored/portable-snippets}} for {{arrow/util/int_util_internal.h:30}}. was (Author: alenkaf): Also [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h] would be needed in {{dist/include}} for [{{arrow/python/python_to_arrow.cc:44}}|https://github.com/apache/arrow/blob/1cdedc4cbf0709ce440d69242afd47474a7148c7/cpp/src/arrow/python/python_to_arrow.cc#L44]. > [C++] xxhash not installed into dist/lib/include when building C++ > -- > > Key: ARROW-16609 > URL: https://issues.apache.org/jira/browse/ARROW-16609 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Alenka Frim >Priority: Blocker > Fix For: 9.0.0 > > > My C++ build setup doesn’t install {{dist/include/arrow/vendored/xxhash/}} > but only {{dist/include/arrow/vendored/xxhash.h}}. The last time the module > was installed was in november 2021. > As {{arrow/python/arrow_to_pandas.cc}} includes {{arrow/util/hashing.h}} -> > {{arrow/vendored/xxhash.h}} -> {{arrow/vendored/xxhash/xxhash.h}} this > module is needed to try to build Python C++ API separately from C++ > (https://issues.apache.org/jira/browse/ARROW-16340). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16609) [C++] xxhash not installed into dist/lib/include when building C++
[ https://issues.apache.org/jira/browse/ARROW-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539285#comment-17539285 ] Alenka Frim commented on ARROW-16609: - Also [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h] would be needed in {{dist/include}} for [{{arrow/python/python_to_arrow.cc:44}}|https://github.com/apache/arrow/blob/1cdedc4cbf0709ce440d69242afd47474a7148c7/cpp/src/arrow/python/python_to_arrow.cc#L44]. > [C++] xxhash not installed into dist/lib/include when building C++ > -- > > Key: ARROW-16609 > URL: https://issues.apache.org/jira/browse/ARROW-16609 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Alenka Frim >Priority: Blocker > Fix For: 9.0.0 > > > My C++ build setup doesn’t install {{dist/include/arrow/vendored/xxhash/}} > but only {{dist/include/arrow/vendored/xxhash.h}}. The last time the module > was installed was in november 2021. > As {{arrow/python/arrow_to_pandas.cc}} includes {{arrow/util/hashing.h}} -> > {{arrow/vendored/xxhash.h}} -> {{arrow/vendored/xxhash/xxhash.h}} this > module is needed to try to build Python C++ API separately from C++ > (https://issues.apache.org/jira/browse/ARROW-16340). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16612) [R] parquet files with compression extensions should use parquet writer for compression
[ https://issues.apache.org/jira/browse/ARROW-16612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alenka Frim updated ARROW-16612: Summary: [R] parquet files with compression extensions should use parquet writer for compression (was: parquet files with compression extensions should use parquet writer for compression) > [R] parquet files with compression extensions should use parquet writer for > compression > --- > > Key: ARROW-16612 > URL: https://issues.apache.org/jira/browse/ARROW-16612 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 8.0.0 >Reporter: Sam Albers >Priority: Minor > > Right now arrow will silently write a file with a .gz extension to > CompressedOutputStream rather than passing the compression option to the > parquet writer itself. The internal detect_compression() function detects the > extension and that is what passes it off incorrectly. However it only fails > at the read_parquet stage which could lead to confusion. > {code:java} > library(arrow, warn.conflicts = FALSE) > tf <- tempfile(fileext = ".parquet.gz") > write_parquet(data.frame(x = 1:5), tf, compression = "gzip", > compression_level = 5) read_parquet(tf) > #> Error: file must be a "RandomAccessFile"{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-15222) [Ruby] Use Compute functions for Enumerable operations on Column
[ https://issues.apache.org/jira/browse/ARROW-15222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-15222. -- Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 12053 [https://github.com/apache/arrow/pull/12053] > [Ruby] Use Compute functions for Enumerable operations on Column > > > Key: ARROW-15222 > URL: https://issues.apache.org/jira/browse/ARROW-15222 > Project: Apache Arrow > Issue Type: Improvement > Components: Ruby >Reporter: Kanstantsin Ilchanka >Assignee: Kanstantsin Ilchanka >Priority: Minor > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Currently operations like > {code:java} > table['column'].sum{code} > use Enumerable module and much slower than using Arrow::Function sum -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16604) [C++] Boost not included when build benchmarks
[ https://issues.apache.org/jira/browse/ARROW-16604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-16604. -- Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13192 [https://github.com/apache/arrow/pull/13192] > [C++] Boost not included when build benchmarks > -- > > Key: ARROW-16604 > URL: https://issues.apache.org/jira/browse/ARROW-16604 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Yibo Cai >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > {code:bash} > cmake -GNinja -DARROW_BUILD_BENCHMARKS=ON .. > {code} > failed with many boost related error, as below > {code:bash} > CMake Error at cmake_modules/BuildUtils.cmake:522 (add_executable): > Target "arrow-json-parser-benchmark" links to target "Boost::system" but > the target was not found. Perhaps a find_package() call is missing for an > IMPORTED target, or an ALIAS target is missing? > Call Stack (most recent call first): > src/arrow/CMakeLists.txt:114 (add_benchmark) > src/arrow/json/CMakeLists.txt:28 (add_arrow_benchmark) > {code} > The error is gone if also build tests {{-DARROW_BUILD_TESTS=ON}}. Looks boost > is not included when build benchmarks. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16608) [Gandiva][Java] Unsatisfied Link Error on M1 Mac when using mavencentral artifacts
[ https://issues.apache.org/jira/browse/ARROW-16608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539166#comment-17539166 ] Kouhei Sutou commented on ARROW-16608: -- It seems that we need to build bundled binaries on M1 mac like we did for wheels. Related files: * https://github.com/apache/arrow/blob/master/dev/tasks/python-wheels/github.osx.arm64.yml for wheel * https://github.com/apache/arrow/blob/master/dev/tasks/java-jars/github.yml for jars Should we create one {{libgandiva_jni.dylib}} that contains binaries for x86_64 and arm64? Or separated files such as {{libgandiva_jni_x86_64.dylib}} and {{libgandiva_jni_arm64.dylib}} or {{x86_64/libgandiva_jni.dylib}} and {{arm64/libgandiva_jni.dylib}}? [~anthonylouis] Do you want to work on this? > [Gandiva][Java] Unsatisfied Link Error on M1 Mac when using mavencentral > artifacts > -- > > Key: ARROW-16608 > URL: https://issues.apache.org/jira/browse/ARROW-16608 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva, Java >Affects Versions: 8.0.0 >Reporter: Jonathan Swenson >Priority: Major > > Potentially a blocker for Arrow Integration into Calcite: CALCITE-2040, > however it may be possible to move forward without M1 Mac support. > potentially somewhat related to ARROW-11135 > Getting an instance of the JNILoader throw a Unsatisfied Link Error when it > tries to load the libgandiva_jni.dylib that it has extracted from the jar > into a temporary directory. > Simplified error: > {code:java} > Exception in thread "main" java.lang.UnsatisfiedLinkError: > /tmp_dir/libgandiva_jni.dylib_uuid: > dlopen(/tmp_dir/libgandiva_jni.dylib_uuid, 0x0001): tried: > '/tmp_dir/libgandiva_jni.dylib_uuid' (mach-o file, but is an incompatible > architecture (have 'x86_64', need 'arm64e')){code} > > Full error and stack trace: > {code:java} > Exception in thread "main" java.lang.UnsatisfiedLinkError: > /private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe: > > dlopen(/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe, > 0x0001): tried: > '/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe' > (mach-o file, but is an incompatible architecture (have 'x86_64', need > 'arm64e')) > at java.lang.ClassLoader$NativeLibrary.load(Native Method) > at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1950) > at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1832) > at java.lang.Runtime.load0(Runtime.java:811) > at java.lang.System.load(System.java:1088) > at > org.apache.arrow.gandiva.evaluator.JniLoader.loadGandivaLibraryFromJar(JniLoader.java:74) > at > org.apache.arrow.gandiva.evaluator.JniLoader.setupInstance(JniLoader.java:63) > at > org.apache.arrow.gandiva.evaluator.JniLoader.getInstance(JniLoader.java:53) > at > org.apache.arrow.gandiva.evaluator.JniLoader.getDefaultConfiguration(JniLoader.java:144) > at org.apache.arrow.gandiva.evaluator.Filter.make(Filter.java:67) > at io.acme.Main.main(Main.java:26) {code} > > This example loads three libraries from mavencentral using gradle: > {code:java} > repositories { > mavenCentral() > } > dependencies { > implementation("org.apache.arrow:arrow-memory-netty:8.0.0") > implementation("org.apache.arrow:arrow-vector:8.0.0") > implementation("org.apache.arrow.gandiva:arrow-gandiva:8.0.0") > } {code} > Example code: > {code:java} > public class Main { > public static void main(String[] args) throws GandivaException { > Field field = new Field("int_field", FieldType.nullable(new > ArrowType.Int(32, true)), null); > Schema schema = makeSchema(field); > Condition condition = makeCondition(field); > Filter.make(schema, condition); > } > private static Schema makeSchema(Field field) { > List fieldList = new ArrayList<>(); > fieldList.add(field); > return new Schema(fieldList, null); > } > private static Condition makeCondition(Field f) { > List treeNodes = new ArrayList<>(2); > treeNodes.add(TreeBuilder.makeField(f)); > treeNodes.add(TreeBuilder.makeLiteral(4)); > TreeNode comparison = TreeBuilder.makeFunction("less_than", treeNodes, > new ArrowType.Bool()); > return TreeBuilder.makeCondition(comparison); > } > } {code} > While I haven't tested this exact example, a similar example executes without > issue on an intel x86 mac. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16585) [C++] CMake and pkg-config files are broken when CMAKE_INSTALL_{BIN,INCLUDE,LIB}DIR is absolute
[ https://issues.apache.org/jira/browse/ARROW-16585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-16585. -- Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13182 [https://github.com/apache/arrow/pull/13182] > [C++] CMake and pkg-config files are broken when > CMAKE_INSTALL_{BIN,INCLUDE,LIB}DIR is absolute > --- > > Key: ARROW-16585 > URL: https://issues.apache.org/jira/browse/ARROW-16585 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Alexander Shpilkin >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > As per title: {{{}cpp/src/gandiva/gandiva.pc.in{}}}, > {{{}cpp/src/parquet/parquet.pc.in{}}}, {{{}cpp/src/plasma/plasma.pc.in{}}}, > and {{cpp/src/skyhook/skyhook.pc.in}} have > {code:java} > prefix=@CMAKE_INSTALL_PREFIX@ > libdir=${prefix}/@CMAKE_INSTALL_LIBDIR@ > includedir=${prefix}/@CMAKE_INSTALL_INCLUDEDIR@ # not in plasma.pc.in{code} > while {{cpp/src/plasma/PlasmaConfig.cmake.in}} has > {code:java} > set(PLASMA_STORE_SERVER > "@CMAKE_INSTALL_PREFIX@/@CMAKE_INSTALL_BINDIR@/plasma-store-server@CMAKE_EXECUTABLE_SUFFIX@"){code} > and so they can’t handle absolute paths in > {{{}CMAKE_INSTALL_\{BIN,INCLUDE,LIB}DIR{}}}. This leads to broken .pc files > on NixOS in particular. > See “[Concatenating paths when building pkg-config > files|https://github.com/jtojnar/cmake-snips#concatenating-paths-when-building-pkg-config-files]” > for a thorough discussion of the problem and a suggested fix, or [KDE’s > extra-cmake-modules|https://invent.kde.org/frameworks/extra-cmake-modules/-/blob/master/modules/ECMGeneratePkgConfigFile.cmake#L166] > for a simpler approach. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-15936) [Ruby] Add test for Arrow::DictionaryArray#raw_records
[ https://issues.apache.org/jira/browse/ARROW-15936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-15936. -- Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 12904 [https://github.com/apache/arrow/pull/12904] > [Ruby] Add test for Arrow::DictionaryArray#raw_records > -- > > Key: ARROW-15936 > URL: https://issues.apache.org/jira/browse/ARROW-15936 > Project: Apache Arrow > Issue Type: Sub-task > Components: Ruby >Reporter: Keisuke Okada >Assignee: Keisuke Okada >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539142#comment-17539142 ] Kouhei Sutou commented on ARROW-15678: -- How about using template to distinct implementation for each architecture? {noformat} diff --git a/cpp/src/arrow/compute/kernels/codegen_internal.h b/cpp/src/arrow/compute/kernels/codegen_internal.h index fa50427bc3..a4bd0eb586 100644 --- a/cpp/src/arrow/compute/kernels/codegen_internal.h +++ b/cpp/src/arrow/compute/kernels/codegen_internal.h @@ -710,8 +710,8 @@ struct ScalarUnaryNotNullStateful { Datum* out) { Status st = Status::OK(); ArrayData* out_arr = out->mutable_array(); - FirstTimeBitmapWriter out_writer(out_arr->buffers[1]->mutable_data(), - out_arr->offset, out_arr->length); + FirstTimeBitmapWriter<> out_writer(out_arr->buffers[1]->mutable_data(), + out_arr->offset, out_arr->length); VisitArrayValuesInline( arg0, [&](Arg0Value v) { diff --git a/cpp/src/arrow/compute/kernels/row_encoder.cc b/cpp/src/arrow/compute/kernels/row_encoder.cc index 10a1f4cda5..26316ec315 100644 --- a/cpp/src/arrow/compute/kernels/row_encoder.cc +++ b/cpp/src/arrow/compute/kernels/row_encoder.cc @@ -42,7 +42,7 @@ Status KeyEncoder::DecodeNulls(MemoryPool* pool, int32_t length, uint8_t** encod ARROW_ASSIGN_OR_RAISE(*null_bitmap, AllocateBitmap(length, pool)); uint8_t* validity = (*null_bitmap)->mutable_data(); -FirstTimeBitmapWriter writer(validity, 0, length); +FirstTimeBitmapWriter<> writer(validity, 0, length); for (int32_t i = 0; i < length; ++i) { if (encoded_bytes[i][0] == kValidByte) { writer.Set(); diff --git a/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc b/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc index 7d8d2edc4b..433df0f1b7 100644 --- a/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc +++ b/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc @@ -353,8 +353,8 @@ struct IsInVisitor { const auto& state = checked_cast&>(*ctx->state()); ArrayData* output = out->mutable_array(); -FirstTimeBitmapWriter writer(output->buffers[1]->mutable_data(), output->offset, - output->length); +FirstTimeBitmapWriter<> writer(output->buffers[1]->mutable_data(), output->offset, + output->length); VisitArrayDataInline( this->data, diff --git a/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc b/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc index 611601cab8..da7de1c277 100644 --- a/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc +++ b/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc @@ -1456,7 +1456,7 @@ struct MatchSubstringImpl { [](const void* raw_offsets, const uint8_t* data, int64_t length, int64_t output_offset, uint8_t* output) { const offset_type* offsets = reinterpret_cast(raw_offsets); - FirstTimeBitmapWriter bitmap_writer(output, output_offset, length); + FirstTimeBitmapWriter<> bitmap_writer(output, output_offset, length); for (int64_t i = 0; i < length; ++i) { const char* current_data = reinterpret_cast(data + offsets[i]); int64_t current_length = offsets[i + 1] - offsets[i]; diff --git a/cpp/src/arrow/util/bit_util_benchmark.cc b/cpp/src/arrow/util/bit_util_benchmark.cc index 258fd27785..66a81b4e04 100644 --- a/cpp/src/arrow/util/bit_util_benchmark.cc +++ b/cpp/src/arrow/util/bit_util_benchmark.cc @@ -386,7 +386,7 @@ static void BitmapWriter(benchmark::State& state) { } static void FirstTimeBitmapWriter(benchmark::State& state) { - BenchmarkBitmapWriter(state, state.range(0)); + BenchmarkBitmapWriter>(state, state.range(0)); } struct GenerateBitsFunctor { diff --git a/cpp/src/arrow/util/bit_util_test.cc b/cpp/src/arrow/util/bit_util_test.cc index 6c2aff4fbe..9b9f19feb1 100644 --- a/cpp/src/arrow/util/bit_util_test.cc +++ b/cpp/src/arrow/util/bit_util_test.cc @@ -832,14 +832,14 @@ TEST(FirstTimeBitmapWriter, NormalOperation) { const uint8_t fill_byte = static_cast(fill_byte_int); { uint8_t bitmap[] = {fill_byte, fill_byte, fill_byte, fill_byte}; - auto writer = internal::FirstTimeBitmapWriter(bitmap, 0, 12); + auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 0, 12); WriteVectorToWriter(writer, {0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1}); // {0b00110110, 0b1010, 0, 0} ASSERT_BYTES_EQ(bitmap, {0x36, 0x0a}); } { uint8_t bitmap[] = {fill_byte, fill_byte, fill_byte, fill_byte}; - auto writer = internal::FirstTimeBitmapWriter(bitmap, 4, 12); + auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 4, 12); WriteVectorToWriter(writer, {0, 1, 1, 0, 1,
[jira] [Updated] (ARROW-16613) [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)
[ https://issues.apache.org/jira/browse/ARROW-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Barron updated ARROW-16613: Description: Hello! I've noticed that when writing a `_metadata` file with `pyarrow.parquet.write_metadata`, it is very slow with a large `metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that the concatenation inside `metadata.append_row_groups` is very slow. The writer first [iterates over every item of the list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302] and then [concatenates them on each iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799]. Would it be possible to make a vectorized implementation of this? Where `append_row_groups` accepts a list of `FileMetaData` objects, and where concatenation happens only once? Repro (in IPython to use `%time`) {code:java} from io import BytesIO import pyarrow as pa import pyarrow.parquet as pq def create_example_file_meta_data(): data = { "str": pa.array(["a", "b", "c", "d"], type=pa.string()), "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()), "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()), "bool": pa.array([True, True, False, False], type=pa.bool_()), } table = pa.table(data) metadata_collector = [] pq.write_table(table, BytesIO(), metadata_collector=metadata_collector) return table.schema, metadata_collector[0] schema, meta = create_example_file_meta_data() metadata_collector = [meta] * 500 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms # Wall time: 234 ms metadata_collector = [meta] * 1000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms # Wall time: 970 ms metadata_collector = [meta] * 2000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s # Wall time: 4.3 s metadata_collector = [meta] * 4000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s # Wall time: 17.3 s {code} was: Hello! I've noticed that when writing a `_metadata` file with `pyarrow.parquet.write_metadata`, it is very slow with a large `metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that the concatenation inside `metadata.append_row_groups` is very slow. The writer first and [iterates over every item of the list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302] and then [concatenates them on each iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799]. Would it be possible to make a vectorized implementation of this? Where `append_row_groups` accepts a list of `FileMetaData` objects, and where concatenation happens only once? Repro (in IPython to use `%time`) {code:java} from io import BytesIO import pyarrow as pa import pyarrow.parquet as pq def create_example_file_meta_data(): data = { "str": pa.array(["a", "b", "c", "d"], type=pa.string()), "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()), "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()), "bool": pa.array([True, True, False, False], type=pa.bool_()), } table = pa.table(data) metadata_collector = [] pq.write_table(table, BytesIO(), metadata_collector=metadata_collector) return table.schema, metadata_collector[0] schema, meta = create_example_file_meta_data() metadata_collector = [meta] * 500 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms # Wall time: 234 ms metadata_collector = [meta] * 1000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms # Wall time: 970 ms metadata_collector = [meta] * 2000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s # Wall time: 4.3 s metadata_collector = [meta] * 4000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s # Wall time: 17.3 s {code} > [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector > appears to be O(n^2) > - > > Key: ARROW-16613 >
[jira] [Updated] (ARROW-16613) [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)
[ https://issues.apache.org/jira/browse/ARROW-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Barron updated ARROW-16613: Description: Hello! I've noticed that when writing a `_metadata` file with `pyarrow.parquet.write_metadata`, it is very slow with a large `metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that the concatenation inside `metadata.append_row_groups` is very slow. The writer first and [iterates over every item of the list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302] and then [concatenates them on each iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799]. Would it be possible to make a vectorized implementation of this? Where `append_row_groups` accepts a list of `FileMetaData` objects, and where concatenation happens only once? Repro (in IPython to use `%time`) {code:java} from io import BytesIO import pyarrow as pa import pyarrow.parquet as pq def create_example_file_meta_data(): data = { "str": pa.array(["a", "b", "c", "d"], type=pa.string()), "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()), "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()), "bool": pa.array([True, True, False, False], type=pa.bool_()), } table = pa.table(data) metadata_collector = [] pq.write_table(table, BytesIO(), metadata_collector=metadata_collector) return table.schema, metadata_collector[0] schema, meta = create_example_file_meta_data() metadata_collector = [meta] * 500 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms # Wall time: 234 ms metadata_collector = [meta] * 1000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms # Wall time: 970 ms metadata_collector = [meta] * 2000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s # Wall time: 4.3 s metadata_collector = [meta] * 4000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s # Wall time: 17.3 s {code} was: Hello! I've noticed that when writing a `_metadata` file with `pyarrow.parquet.write_metadata`, it is very slow with a large `metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that the concatenation inside `metadata.append_row_groups` is very slow. The writer first and [iterates over every item of the list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302] and then [concatenates them on each iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799]. Would it be possible to make a vectorized implementation of this? Where `append_row_groups` accepts a list of `FileMetaData` objects, and where concatenation happens only once? Repro (in IPython to use `%time`) ``` from io import BytesIO import pyarrow as pa import pyarrow.parquet as pq def create_example_file_meta_data(): data = { "str": pa.array(["a", "b", "c", "d"], type=pa.string()), "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()), "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()), "bool": pa.array([True, True, False, False], type=pa.bool_()), } table = pa.table(data) metadata_collector = [] pq.write_table(table, BytesIO(), metadata_collector=metadata_collector) return table.schema, metadata_collector[0] schema, meta = create_example_file_meta_data() metadata_collector = [meta] * 500 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms # Wall time: 234 ms metadata_collector = [meta] * 1000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms # Wall time: 970 ms metadata_collector = [meta] * 2000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s # Wall time: 4.3 s metadata_collector = [meta] * 4000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s # Wall time: 17.3 s ``` > [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector > appears to be O(n^2) > - > > Key: ARROW-16613 >
[jira] [Created] (ARROW-16613) [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)
Kyle Barron created ARROW-16613: --- Summary: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2) Key: ARROW-16613 URL: https://issues.apache.org/jira/browse/ARROW-16613 Project: Apache Arrow Issue Type: Improvement Components: Parquet, Python Affects Versions: 8.0.0 Reporter: Kyle Barron Hello! I've noticed that when writing a `_metadata` file with `pyarrow.parquet.write_metadata`, it is very slow with a large `metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that the concatenation inside `metadata.append_row_groups` is very slow. The writer first and [iterates over every item of the list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302] and then [concatenates them on each iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799]. Would it be possible to make a vectorized implementation of this? Where `append_row_groups` accepts a list of `FileMetaData` objects, and where concatenation happens only once? Repro (in IPython to use `%time`) ``` from io import BytesIO import pyarrow as pa import pyarrow.parquet as pq def create_example_file_meta_data(): data = { "str": pa.array(["a", "b", "c", "d"], type=pa.string()), "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()), "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()), "bool": pa.array([True, True, False, False], type=pa.bool_()), } table = pa.table(data) metadata_collector = [] pq.write_table(table, BytesIO(), metadata_collector=metadata_collector) return table.schema, metadata_collector[0] schema, meta = create_example_file_meta_data() metadata_collector = [meta] * 500 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms # Wall time: 234 ms metadata_collector = [meta] * 1000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms # Wall time: 970 ms metadata_collector = [meta] * 2000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s # Wall time: 4.3 s metadata_collector = [meta] * 4000 %time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector) # CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s # Wall time: 17.3 s ``` -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-15271) [R] Refactor do_exec_plan to return a RecordBatchReader
[ https://issues.apache.org/jira/browse/ARROW-15271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-15271. - Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13170 [https://github.com/apache/arrow/pull/13170] > [R] Refactor do_exec_plan to return a RecordBatchReader > --- > > Key: ARROW-15271 > URL: https://issues.apache.org/jira/browse/ARROW-15271 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 6.0.1 >Reporter: Will Jones >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Right now > [{{do_exec_plan}}|https://github.com/apache/arrow/blob/master/r/R/query-engine.R#L18] > returns an Arrow table because {{head}}, {{tail}}, and {{arrange}} do. If > ARROW-14289 is completed and similar work is done for {{arrange}}, we may be > able to alter {{do_exec_plan}} to return a RBR instead. > The {{map_batches()}} implementation (ARROW-14029) could benefit from this > refactor. And it might make ARROW-15040 more useful. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16612) parquet files with compression extensions should use parquet writer for compression
Sam Albers created ARROW-16612: -- Summary: parquet files with compression extensions should use parquet writer for compression Key: ARROW-16612 URL: https://issues.apache.org/jira/browse/ARROW-16612 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 8.0.0 Reporter: Sam Albers Right now arrow will silently write a file with a .gz extension to CompressedOutputStream rather than passing the compression option to the parquet writer itself. The internal detect_compression() function detects the extension and that is what passes it off incorrectly. However it only fails at the read_parquet stage which could lead to confusion. {code:java} library(arrow, warn.conflicts = FALSE) tf <- tempfile(fileext = ".parquet.gz") write_parquet(data.frame(x = 1:5), tf, compression = "gzip", compression_level = 5) read_parquet(tf) #> Error: file must be a "RandomAccessFile"{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539107#comment-17539107 ] Jonathan Keane edited comment on ARROW-15678 at 5/18/22 10:03 PM: -- [~kou] Do you think you might be able to take a look at this? The comment at https://github.com/apache/arrow/pull/12928#issuecomment-1105955726 has a good explanation of what's going on and following that there are a few possible fixes (though none of them were fully implemented or decided was (Author: jonkeane): @kou Do you think you might be able to take a look at this? The comment at https://github.com/apache/arrow/pull/12928#issuecomment-1105955726 has a good explanation of what's going on and following that there are a few possible fixes (though none of them were fully implemented or decided > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 13h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539107#comment-17539107 ] Jonathan Keane commented on ARROW-15678: @kou Do you think you might be able to take a look at this? The comment at https://github.com/apache/arrow/pull/12928#issuecomment-1105955726 has a good explanation of what's going on and following that there are a few possible fixes (though none of them were fully implemented or decided > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 13h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16144) [R] Write compressed data streams (particularly over S3)
[ https://issues.apache.org/jira/browse/ARROW-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-16144. - Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13183 [https://github.com/apache/arrow/pull/13183] > [R] Write compressed data streams (particularly over S3) > > > Key: ARROW-16144 > URL: https://issues.apache.org/jira/browse/ARROW-16144 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 7.0.0 >Reporter: Carl Boettiger >Assignee: Sam Albers >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > The python bindings have `CompressedOutputStream`, but I don't see how we > can do this on the R side (e.g. with `write_csv_arrow()`). It would be > wonderful if we could both read and write compressed streams, particularly > for CSV and particularly for remote filesystems, where this can provide > considerable performance improvements. > (For comparison, readr will write a compressed stream automatically based on > the extension for the given filename, e.g. `readr::write_csv(data, > "file.csv.gz")` or `write_csv("data.file.xz")` ) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16596) [C++] Add a strptime option to control the cutoff between 1900 and 2000 when %y
[ https://issues.apache.org/jira/browse/ARROW-16596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539077#comment-17539077 ] Dragoș Moldovan-Grünfeld commented on ARROW-16596: -- Maybe related. I find the following unexpected: {code:r} library(arrow, warn.conflicts = FALSE) a <- Array$create("68-10-07 19:04:0") call_function("strptime", a, options = list(format = "%Y-%m-%d %H:%M:%S", unit = 0L)) #> Array #> #> [ #> 0068-10-07 19:04:00 #> ] call_function("strptime", a, options = list(format = "%y-%m-%d %H:%M:%S", unit = 0L)) #> Array #> #> [ #> 2068-10-07 19:04:00 #> ] {code} I would expect an error when there is a mismatch between the string and the format, i.e. string has a short year ({{{}%y{}}}) and we try to parse using a long format ({{{}%Y{}}}). I think it would be much better to error or return a null in this situation. > [C++] Add a strptime option to control the cutoff between 1900 and 2000 when > %y > > > Key: ARROW-16596 > URL: https://issues.apache.org/jira/browse/ARROW-16596 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Affects Versions: 8.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > > When parsing to datetime a string with year in the short format ({{{}%y{}}}), > it would be great if we could have control over the cutoff point between 1900 > and 2000. Currently it is implicitly set to 68: > {code:r} > library(arrow, warn.conflicts = FALSE) > a <- Array$create(c("68-05-17", "69-05-17")) > call_function("strptime", a, options = list(format = "%y-%m-%d", unit = 0L)) > #> Array > #> > #> [ > #> 2068-05-17 00:00:00, > #> 1969-05-17 00:00:00 > #> ] > {code} > For example, lubridate named this argument {{cutoff_2000}} argument (e.g. for > {{{}fast_strptime){}}}. This works as follows: > {code:r} > library(lubridate, warn.conflicts = FALSE) > dates_vector <- c("68-05-17", "69-05-17", "55-05-17") > fast_strptime(dates_vector, format = "%y-%m-%d") > #> [1] "2068-05-17 UTC" "1969-05-17 UTC" "2055-05-17 UTC" > fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 50) > #> [1] "1968-05-17 UTC" "1969-05-17 UTC" "1955-05-17 UTC" > fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 70) > #> [1] "2068-05-17 UTC" "2069-05-17 UTC" "2055-05-17 UTC" > {code} > In the {{lubridate::fast_strptime()}} documentation it is described as > follows: > {quote}cutoff_2000 > integer. For y format, two-digit numbers smaller or equal to cutoff_2000 are > parsed as though starting with 20, otherwise parsed as though starting with > 19. {-}Available only for functions relying on lubridates internal parser{-}. > {quote} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16596) [C++] Add a strptime option to control the cutoff between 1900 and 2000 when %y
[ https://issues.apache.org/jira/browse/ARROW-16596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld updated ARROW-16596: - Description: When parsing to datetime a string with year in the short format ({{{}%y{}}}), it would be great if we could have control over the cutoff point between 1900 and 2000. Currently it is implicitly set to 68: {code:r} library(arrow, warn.conflicts = FALSE) a <- Array$create(c("68-05-17", "69-05-17")) call_function("strptime", a, options = list(format = "%y-%m-%d", unit = 0L)) #> Array #> #> [ #> 2068-05-17 00:00:00, #> 1969-05-17 00:00:00 #> ] {code} For example, lubridate named this argument {{cutoff_2000}} argument (e.g. for {{{}fast_strptime){}}}. This works as follows: {code:r} library(lubridate, warn.conflicts = FALSE) dates_vector <- c("68-05-17", "69-05-17", "55-05-17") fast_strptime(dates_vector, format = "%y-%m-%d") #> [1] "2068-05-17 UTC" "1969-05-17 UTC" "2055-05-17 UTC" fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 50) #> [1] "1968-05-17 UTC" "1969-05-17 UTC" "1955-05-17 UTC" fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 70) #> [1] "2068-05-17 UTC" "2069-05-17 UTC" "2055-05-17 UTC" {code} In the {{lubridate::fast_strptime()}} documentation it is described as follows: {quote}cutoff_2000 integer. For y format, two-digit numbers smaller or equal to cutoff_2000 are parsed as though starting with 20, otherwise parsed as though starting with 19. {-}Available only for functions relying on lubridates internal parser{-}. {quote} was: When parsing to datetime a string with year in the short format ({{{}%y{}}}), it would be great if we could have control over the cutoff point between 1900 and 2000. Currently it is implicitly set to 68: {code:r} library(arrow, warn.conflicts = FALSE) a <- Array$create(c("68-05-17", "69-05-17")) call_function("strptime", a, options = list(format = "%y-%m-%d", unit = 0L)) #> Array #> #> [ #> 2068-05-17 00:00:00, #> 1969-05-17 00:00:00 #> ] {code} For example, lubridate named this argument {{cutoff_2000}} argument (e.g. for {{{}fast_strptime){}}}. This works as follows: {code:r} library(lubridate, warn.conflicts = FALSE) dates_vector <- c("68-05-17", "69-05-17", "55-05-17") fast_strptime(dates_vector, format = "%y-%m-%d") #> [1] "2068-05-17 UTC" "1969-05-17 UTC" "2055-05-17 UTC" fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 50) #> [1] "1968-05-17 UTC" "1969-05-17 UTC" "1955-05-17 UTC" fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 70) #> [1] "2068-05-17 UTC" "2069-05-17 UTC" "2055-05-17 UTC" {code} In the {{lubridate::fast_strptime()}} documentation it is described as follows: {quote}cutoff_2000 integer. For y format, two-digit numbers smaller or equal to cutoff_2000 are parsed as though starting with 20, otherwise parsed as though starting with 19. Available only for functions relying on lubridates internal parser. {quote} > [C++] Add a strptime option to control the cutoff between 1900 and 2000 when > %y > > > Key: ARROW-16596 > URL: https://issues.apache.org/jira/browse/ARROW-16596 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Affects Versions: 8.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > > When parsing to datetime a string with year in the short format ({{{}%y{}}}), > it would be great if we could have control over the cutoff point between 1900 > and 2000. Currently it is implicitly set to 68: > {code:r} > library(arrow, warn.conflicts = FALSE) > a <- Array$create(c("68-05-17", "69-05-17")) > call_function("strptime", a, options = list(format = "%y-%m-%d", unit = 0L)) > #> Array > #> > #> [ > #> 2068-05-17 00:00:00, > #> 1969-05-17 00:00:00 > #> ] > {code} > For example, lubridate named this argument {{cutoff_2000}} argument (e.g. for > {{{}fast_strptime){}}}. This works as follows: > {code:r} > library(lubridate, warn.conflicts = FALSE) > dates_vector <- c("68-05-17", "69-05-17", "55-05-17") > fast_strptime(dates_vector, format = "%y-%m-%d") > #> [1] "2068-05-17 UTC" "1969-05-17 UTC" "2055-05-17 UTC" > fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 50) > #> [1] "1968-05-17 UTC" "1969-05-17 UTC" "1955-05-17 UTC" > fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 70) > #> [1] "2068-05-17 UTC" "2069-05-17 UTC" "2055-05-17 UTC" > {code} > In the {{lubridate::fast_strptime()}} documentation it is described as > follows: > {quote}cutoff_2000 > integer. For y format, two-digit numbers smaller or equal to cutoff_2000 are > parsed as though starting with 20, otherwise parsed as though starting with > 19. {-}Available only for functions relying on lubridates internal parser{-}. > {quote} -- This
[jira] [Updated] (ARROW-16604) [C++] Boost not included when build benchmarks
[ https://issues.apache.org/jira/browse/ARROW-16604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16604: --- Labels: pull-request-available (was: ) > [C++] Boost not included when build benchmarks > -- > > Key: ARROW-16604 > URL: https://issues.apache.org/jira/browse/ARROW-16604 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Yibo Cai >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > {code:bash} > cmake -GNinja -DARROW_BUILD_BENCHMARKS=ON .. > {code} > failed with many boost related error, as below > {code:bash} > CMake Error at cmake_modules/BuildUtils.cmake:522 (add_executable): > Target "arrow-json-parser-benchmark" links to target "Boost::system" but > the target was not found. Perhaps a find_package() call is missing for an > IMPORTED target, or an ALIAS target is missing? > Call Stack (most recent call first): > src/arrow/CMakeLists.txt:114 (add_benchmark) > src/arrow/json/CMakeLists.txt:28 (add_arrow_benchmark) > {code} > The error is gone if also build tests {{-DARROW_BUILD_TESTS=ON}}. Looks boost > is not included when build benchmarks. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16608) [Gandiva][Java] Unsatisfied Link Error on M1 Mac when using mavencentral artifacts
[ https://issues.apache.org/jira/browse/ARROW-16608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Swenson updated ARROW-16608: - Summary: [Gandiva][Java] Unsatisfied Link Error on M1 Mac when using mavencentral artifacts (was: [Gandiva][Java] Unsatisfied Link Error on M1 Mac for aarch) > [Gandiva][Java] Unsatisfied Link Error on M1 Mac when using mavencentral > artifacts > -- > > Key: ARROW-16608 > URL: https://issues.apache.org/jira/browse/ARROW-16608 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva, Java >Affects Versions: 8.0.0 >Reporter: Jonathan Swenson >Priority: Major > > Potentially a blocker for Arrow Integration into Calcite: CALCITE-2040, > however it may be possible to move forward without M1 Mac support. > potentially somewhat related to ARROW-11135 > Getting an instance of the JNILoader throw a Unsatisfied Link Error when it > tries to load the libgandiva_jni.dylib that it has extracted from the jar > into a temporary directory. > Simplified error: > {code:java} > Exception in thread "main" java.lang.UnsatisfiedLinkError: > /tmp_dir/libgandiva_jni.dylib_uuid: > dlopen(/tmp_dir/libgandiva_jni.dylib_uuid, 0x0001): tried: > '/tmp_dir/libgandiva_jni.dylib_uuid' (mach-o file, but is an incompatible > architecture (have 'x86_64', need 'arm64e')){code} > > Full error and stack trace: > {code:java} > Exception in thread "main" java.lang.UnsatisfiedLinkError: > /private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe: > > dlopen(/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe, > 0x0001): tried: > '/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe' > (mach-o file, but is an incompatible architecture (have 'x86_64', need > 'arm64e')) > at java.lang.ClassLoader$NativeLibrary.load(Native Method) > at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1950) > at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1832) > at java.lang.Runtime.load0(Runtime.java:811) > at java.lang.System.load(System.java:1088) > at > org.apache.arrow.gandiva.evaluator.JniLoader.loadGandivaLibraryFromJar(JniLoader.java:74) > at > org.apache.arrow.gandiva.evaluator.JniLoader.setupInstance(JniLoader.java:63) > at > org.apache.arrow.gandiva.evaluator.JniLoader.getInstance(JniLoader.java:53) > at > org.apache.arrow.gandiva.evaluator.JniLoader.getDefaultConfiguration(JniLoader.java:144) > at org.apache.arrow.gandiva.evaluator.Filter.make(Filter.java:67) > at io.acme.Main.main(Main.java:26) {code} > > This example loads three libraries from mavencentral using gradle: > {code:java} > repositories { > mavenCentral() > } > dependencies { > implementation("org.apache.arrow:arrow-memory-netty:8.0.0") > implementation("org.apache.arrow:arrow-vector:8.0.0") > implementation("org.apache.arrow.gandiva:arrow-gandiva:8.0.0") > } {code} > Example code: > {code:java} > public class Main { > public static void main(String[] args) throws GandivaException { > Field field = new Field("int_field", FieldType.nullable(new > ArrowType.Int(32, true)), null); > Schema schema = makeSchema(field); > Condition condition = makeCondition(field); > Filter.make(schema, condition); > } > private static Schema makeSchema(Field field) { > List fieldList = new ArrayList<>(); > fieldList.add(field); > return new Schema(fieldList, null); > } > private static Condition makeCondition(Field f) { > List treeNodes = new ArrayList<>(2); > treeNodes.add(TreeBuilder.makeField(f)); > treeNodes.add(TreeBuilder.makeLiteral(4)); > TreeNode comparison = TreeBuilder.makeFunction("less_than", treeNodes, > new ArrowType.Bool()); > return TreeBuilder.makeCondition(comparison); > } > } {code} > While I haven't tested this exact example, a similar example executes without > issue on an intel x86 mac. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16608) [Gandiva][Java] Unsatisfied Link Error on M1 Mac for aarch
[ https://issues.apache.org/jira/browse/ARROW-16608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539073#comment-17539073 ] Jonathan Swenson commented on ARROW-16608: -- Building the gandiva library / jar from source on the M1 mac (on master) then loading that in manually works, but the dependencies hosted in maven do not seem to be deployed in a way that permits usage from a project running on apple silicon. > [Gandiva][Java] Unsatisfied Link Error on M1 Mac for aarch > -- > > Key: ARROW-16608 > URL: https://issues.apache.org/jira/browse/ARROW-16608 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva, Java >Affects Versions: 8.0.0 >Reporter: Jonathan Swenson >Priority: Major > > Potentially a blocker for Arrow Integration into Calcite: CALCITE-2040, > however it may be possible to move forward without M1 Mac support. > potentially somewhat related to ARROW-11135 > Getting an instance of the JNILoader throw a Unsatisfied Link Error when it > tries to load the libgandiva_jni.dylib that it has extracted from the jar > into a temporary directory. > Simplified error: > {code:java} > Exception in thread "main" java.lang.UnsatisfiedLinkError: > /tmp_dir/libgandiva_jni.dylib_uuid: > dlopen(/tmp_dir/libgandiva_jni.dylib_uuid, 0x0001): tried: > '/tmp_dir/libgandiva_jni.dylib_uuid' (mach-o file, but is an incompatible > architecture (have 'x86_64', need 'arm64e')){code} > > Full error and stack trace: > {code:java} > Exception in thread "main" java.lang.UnsatisfiedLinkError: > /private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe: > > dlopen(/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe, > 0x0001): tried: > '/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe' > (mach-o file, but is an incompatible architecture (have 'x86_64', need > 'arm64e')) > at java.lang.ClassLoader$NativeLibrary.load(Native Method) > at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1950) > at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1832) > at java.lang.Runtime.load0(Runtime.java:811) > at java.lang.System.load(System.java:1088) > at > org.apache.arrow.gandiva.evaluator.JniLoader.loadGandivaLibraryFromJar(JniLoader.java:74) > at > org.apache.arrow.gandiva.evaluator.JniLoader.setupInstance(JniLoader.java:63) > at > org.apache.arrow.gandiva.evaluator.JniLoader.getInstance(JniLoader.java:53) > at > org.apache.arrow.gandiva.evaluator.JniLoader.getDefaultConfiguration(JniLoader.java:144) > at org.apache.arrow.gandiva.evaluator.Filter.make(Filter.java:67) > at io.acme.Main.main(Main.java:26) {code} > > This example loads three libraries from mavencentral using gradle: > {code:java} > repositories { > mavenCentral() > } > dependencies { > implementation("org.apache.arrow:arrow-memory-netty:8.0.0") > implementation("org.apache.arrow:arrow-vector:8.0.0") > implementation("org.apache.arrow.gandiva:arrow-gandiva:8.0.0") > } {code} > Example code: > {code:java} > public class Main { > public static void main(String[] args) throws GandivaException { > Field field = new Field("int_field", FieldType.nullable(new > ArrowType.Int(32, true)), null); > Schema schema = makeSchema(field); > Condition condition = makeCondition(field); > Filter.make(schema, condition); > } > private static Schema makeSchema(Field field) { > List fieldList = new ArrayList<>(); > fieldList.add(field); > return new Schema(fieldList, null); > } > private static Condition makeCondition(Field f) { > List treeNodes = new ArrayList<>(2); > treeNodes.add(TreeBuilder.makeField(f)); > treeNodes.add(TreeBuilder.makeLiteral(4)); > TreeNode comparison = TreeBuilder.makeFunction("less_than", treeNodes, > new ArrowType.Bool()); > return TreeBuilder.makeCondition(comparison); > } > } {code} > While I haven't tested this exact example, a similar example executes without > issue on an intel x86 mac. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (ARROW-16604) [C++] Boost not included when build benchmarks
[ https://issues.apache.org/jira/browse/ARROW-16604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-16604: Assignee: Kouhei Sutou > [C++] Boost not included when build benchmarks > -- > > Key: ARROW-16604 > URL: https://issues.apache.org/jira/browse/ARROW-16604 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Yibo Cai >Assignee: Kouhei Sutou >Priority: Major > > {code:bash} > cmake -GNinja -DARROW_BUILD_BENCHMARKS=ON .. > {code} > failed with many boost related error, as below > {code:bash} > CMake Error at cmake_modules/BuildUtils.cmake:522 (add_executable): > Target "arrow-json-parser-benchmark" links to target "Boost::system" but > the target was not found. Perhaps a find_package() call is missing for an > IMPORTED target, or an ALIAS target is missing? > Call Stack (most recent call first): > src/arrow/CMakeLists.txt:114 (add_benchmark) > src/arrow/json/CMakeLists.txt:28 (add_arrow_benchmark) > {code} > The error is gone if also build tests {{-DARROW_BUILD_TESTS=ON}}. Looks boost > is not included when build benchmarks. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16281) [R] [CI] Bump versions with the release of 4.2
[ https://issues.apache.org/jira/browse/ARROW-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-16281. Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 12980 [https://github.com/apache/arrow/pull/12980] > [R] [CI] Bump versions with the release of 4.2 > -- > > Key: ARROW-16281 > URL: https://issues.apache.org/jira/browse/ARROW-16281 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jonathan Keane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 11h 40m > Remaining Estimate: 0h > > Now that R 4.2 is released, we should bump all of our R versions where we > have ones hardcoded. > This will mean dropping support for 3.4 entirely and adding in 4.0 to > https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/dev/tasks/r/github.linux.versions.yml#L34 > There are a few other places that we have hard-coded versions (we might need > to wait a few days for these to catch up): > https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/dev/tasks/tasks.yml#L1291-L1295 > https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/.github/workflows/r.yml#L60 > (and a few other places in that file — though one note: we build an old > version of windows that uses rtools35 in the GHA CI so that we catch when we > break that — we'll want to keep that!) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16611) [Python] MapArray pandas round trip is broken
[ https://issues.apache.org/jira/browse/ARROW-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539066#comment-17539066 ] Antoine Pitrou commented on ARROW-16611: Also, did it work with previous versions of PyArrow? I don't think so. > [Python] MapArray pandas round trip is broken > - > > Key: ARROW-16611 > URL: https://issues.apache.org/jira/browse/ARROW-16611 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: Robbie Gruener >Priority: Major > > pyarrow.MapArray when converted to pandas cannot be successfully converted > back. > The following snipper does not work: > > ``` > import pyarrow as pa > data = [[('x', 1), ('y', 0)], [('a', 2), ('b', 45)]] > ty = pa.map_(pa.string(), pa.int64()) > map_col = pa.array(data, type=ty) > pa.MapArray.from_pandas(map_col.to_pandas()) > ``` > `Uncaught exception: ArrowTypeError: Expected bytes, got a 'int' object > (java.lang.RuntimeException)` -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16611) [Python] MapArray pandas round trip is broken
[ https://issues.apache.org/jira/browse/ARROW-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-16611: --- Priority: Major (was: Minor) > [Python] MapArray pandas round trip is broken > - > > Key: ARROW-16611 > URL: https://issues.apache.org/jira/browse/ARROW-16611 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: Robbie Gruener >Priority: Major > > pyarrow.MapArray when converted to pandas cannot be successfully converted > back. > The following snipper does not work: > > ``` > import pyarrow as pa > data = [[('x', 1), ('y', 0)], [('a', 2), ('b', 45)]] > ty = pa.map_(pa.string(), pa.int64()) > map_col = pa.array(data, type=ty) > pa.MapArray.from_pandas(map_col.to_pandas()) > ``` > `Uncaught exception: ArrowTypeError: Expected bytes, got a 'int' object > (java.lang.RuntimeException)` -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16611) [Python] MapArray pandas round trip is broken
[ https://issues.apache.org/jira/browse/ARROW-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-16611: --- Summary: [Python] MapArray pandas round trip is broken (was: MapArray pandas round trip is broken) > [Python] MapArray pandas round trip is broken > - > > Key: ARROW-16611 > URL: https://issues.apache.org/jira/browse/ARROW-16611 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: Robbie Gruener >Priority: Minor > > pyarrow.MapArray when converted to pandas cannot be successfully converted > back. > The following snipper does not work: > > ``` > import pyarrow as pa > data = [[('x', 1), ('y', 0)], [('a', 2), ('b', 45)]] > ty = pa.map_(pa.string(), pa.int64()) > map_col = pa.array(data, type=ty) > pa.MapArray.from_pandas(map_col.to_pandas()) > ``` > `Uncaught exception: ArrowTypeError: Expected bytes, got a 'int' object > (java.lang.RuntimeException)` -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16611) MapArray pandas round trip is broken
[ https://issues.apache.org/jira/browse/ARROW-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-16611: --- Component/s: Python > MapArray pandas round trip is broken > > > Key: ARROW-16611 > URL: https://issues.apache.org/jira/browse/ARROW-16611 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: Robbie Gruener >Priority: Minor > > pyarrow.MapArray when converted to pandas cannot be successfully converted > back. > The following snipper does not work: > > ``` > import pyarrow as pa > data = [[('x', 1), ('y', 0)], [('a', 2), ('b', 45)]] > ty = pa.map_(pa.string(), pa.int64()) > map_col = pa.array(data, type=ty) > pa.MapArray.from_pandas(map_col.to_pandas()) > ``` > `Uncaught exception: ArrowTypeError: Expected bytes, got a 'int' object > (java.lang.RuntimeException)` -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16611) [Python] MapArray pandas round trip is broken
[ https://issues.apache.org/jira/browse/ARROW-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539065#comment-17539065 ] Antoine Pitrou commented on ARROW-16611: Hmm, where does the "java.lang.RuntimeException" come from? > [Python] MapArray pandas round trip is broken > - > > Key: ARROW-16611 > URL: https://issues.apache.org/jira/browse/ARROW-16611 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: Robbie Gruener >Priority: Major > > pyarrow.MapArray when converted to pandas cannot be successfully converted > back. > The following snipper does not work: > > ``` > import pyarrow as pa > data = [[('x', 1), ('y', 0)], [('a', 2), ('b', 45)]] > ty = pa.map_(pa.string(), pa.int64()) > map_col = pa.array(data, type=ty) > pa.MapArray.from_pandas(map_col.to_pandas()) > ``` > `Uncaught exception: ArrowTypeError: Expected bytes, got a 'int' object > (java.lang.RuntimeException)` -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16611) MapArray pandas round trip is broken
Robbie Gruener created ARROW-16611: -- Summary: MapArray pandas round trip is broken Key: ARROW-16611 URL: https://issues.apache.org/jira/browse/ARROW-16611 Project: Apache Arrow Issue Type: Bug Affects Versions: 8.0.0 Reporter: Robbie Gruener pyarrow.MapArray when converted to pandas cannot be successfully converted back. The following snipper does not work: ``` import pyarrow as pa data = [[('x', 1), ('y', 0)], [('a', 2), ('b', 45)]] ty = pa.map_(pa.string(), pa.int64()) map_col = pa.array(data, type=ty) pa.MapArray.from_pandas(map_col.to_pandas()) ``` `Uncaught exception: ArrowTypeError: Expected bytes, got a 'int' object (java.lang.RuntimeException)` -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16211) [C++][Python] Unregister compute functions
[ https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539052#comment-17539052 ] Yaron Gvili commented on ARROW-16211: - This second-layer-registry approach is good for another use case in which the user runs multiple execution engine invocations, either in sequence or in parallel, from the same Python interpreter and wants to keep separate the UDFs registered in each invocation. > [C++][Python] Unregister compute functions > -- > > Key: ARROW-16211 > URL: https://issues.apache.org/jira/browse/ARROW-16211 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > In general, when using UDFs, the user defines a function expecting a > particular outcome. When building the program, there needs to be a way to > update existing function kernels if it expands beyond what is planned before. > In such situations, there should be a way to remove the existing definition > and add a new definition. To enable this, the unregister functionality has to > be included. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16592) [FlightRPC][Python] Regression in DoPut error handling
[ https://issues.apache.org/jira/browse/ARROW-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16592: --- Labels: pull-request-available (was: ) > [FlightRPC][Python] Regression in DoPut error handling > -- > > Key: ARROW-16592 > URL: https://issues.apache.org/jira/browse/ARROW-16592 > Project: Apache Arrow > Issue Type: Bug > Components: FlightRPC, Python >Reporter: Lubo Slivka >Assignee: David Li >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In PyArrow 8.0.0, any error raised while handling DoPut on the server results > in FlightInternalError on the client. > In PyArrow 7.0.0, errors raised while handling DoPut are propagated/converted > to non-internal errors. > — > Example: on 7.0.0, raising FlightCancelledError while handling DoPut on the > server would propagate that error including extra_info all the way to the > FlightClient. This is not the case anymore on 8.0.0. > The FlightInternalError contains extra detail that is derived from the > cancelled error though: > {code:java} > /arrow/cpp/src/arrow/flight/client.cc:363: Close() failed: IOError: message from FlightError is here>. Detail: Cancelled. gRPC client debug > context: {"created":"@1652777650.446052211","description":"Error received > from peer > ipv4:127.0.0.1:16001","file":"/opt/vcpkg/buildtrees/grpc/src/85a295989c-6cf7bf442d.clean/src/core/lib/surface/call.cc","file_line":903,"grpc_message":" message from FlightError is here>. Detail: Cancelled","grpc_status":1}. > Client context: OK. Detail: Cancelled > {code} > Note: skimming through the code, it seems this problem is not unique to > PyArrow. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16211) [C++][Python] Unregister compute functions
[ https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539044#comment-17539044 ] Yaron Gvili commented on ARROW-16211: - Another alternative to consider is registering Python UDFs to an extension registry instance that (1) is specific to the Python interpreter and (2) is linked to the default global one (so it can find both UDF and normal functions). This Python-specific registry would then be passed to be used by the execution engine. I think this way (only) the Python-specific registry would naturally get cleaned up on finalization of the Python interpreter. > [C++][Python] Unregister compute functions > -- > > Key: ARROW-16211 > URL: https://issues.apache.org/jira/browse/ARROW-16211 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++, Python >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > In general, when using UDFs, the user defines a function expecting a > particular outcome. When building the program, there needs to be a way to > update existing function kernels if it expands beyond what is planned before. > In such situations, there should be a way to remove the existing definition > and add a new definition. To enable this, the unregister functionality has to > be included. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16605) [CI][R] Fix revdep Crossbow job
[ https://issues.apache.org/jira/browse/ARROW-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539021#comment-17539021 ] Jonathan Keane commented on ARROW-16605: Thanks for this! One thing I found while running these is that the {targets} package does not behave well when we use multiple workers when running the revdepchecks. It would be awesome if we could run that one with one worker and then all the others with multiple. > [CI][R] Fix revdep Crossbow job > --- > > Key: ARROW-16605 > URL: https://issues.apache.org/jira/browse/ARROW-16605 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jacob Wujciak-Jens >Assignee: Jacob Wujciak-Jens >Priority: Blocker > Fix For: 9.0.0 > > > The revdep Crossbow job is currently not functioning correctly. This led to > changed behaviour affecting a revdep with the 8.0.0 release, requiring a > patch after initial submission. > cc: [~jonkeane] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16610) [Python] Raise an error for conflicting options in pq.write_to_dataset
Alenka Frim created ARROW-16610: --- Summary: [Python] Raise an error for conflicting options in pq.write_to_dataset Key: ARROW-16610 URL: https://issues.apache.org/jira/browse/ARROW-16610 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Alenka Frim Assignee: Alenka Frim Fix For: 9.0.0 A follow-up for https://issues.apache.org/jira/browse/ARROW-16420 : For some of the 'conflicting' options, for instance if the user passes both 'partitioning' and 'partition_cols', or 'metadata_collector' and 'file_visitor', an error should be raised. See: [https://github.com/apache/arrow/pull/13062#pullrequestreview-966014225] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16609) [C++] xxhash not installed into dist/lib/include when building C++
Alenka Frim created ARROW-16609: --- Summary: [C++] xxhash not installed into dist/lib/include when building C++ Key: ARROW-16609 URL: https://issues.apache.org/jira/browse/ARROW-16609 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Alenka Frim Fix For: 9.0.0 My C++ build setup doesn’t install {{dist/include/arrow/vendored/xxhash/}} but only {{dist/include/arrow/vendored/xxhash.h}}. The last time the module was installed was in november 2021. As {{arrow/python/arrow_to_pandas.cc}} includes {{arrow/util/hashing.h}} -> {{arrow/vendored/xxhash.h}} -> {{arrow/vendored/xxhash/xxhash.h}} this module is needed to try to build Python C++ API separately from C++ (https://issues.apache.org/jira/browse/ARROW-16340). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16427) [Java] jdbcToArrowVectors / sqlToArrowVectorIterator fails to handle variable decimal precision / scale
[ https://issues.apache.org/jira/browse/ARROW-16427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-16427. -- Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13166 [https://github.com/apache/arrow/pull/13166] > [Java] jdbcToArrowVectors / sqlToArrowVectorIterator fails to handle variable > decimal precision / scale > --- > > Key: ARROW-16427 > URL: https://issues.apache.org/jira/browse/ARROW-16427 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 7.0.0 >Reporter: Jonathan Swenson >Assignee: Todd Farmer >Priority: Major > Labels: JDBC, Java, pull-request-available > Fix For: 9.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > When a JDBC driver returns a Numeric type that doesn't exactly align with > what is in the JDBC metadata, jdbcToArrowVectors / sqlToArrowVectorIterator > fails to process the result (failing on serializing the the value into the > BigDecimalVector). > It appears as though this is because JDBC drivers can return BigDecimal / > Numeric values that are different between the metadata and not consistent > between each of the rows. > Is there a recommended course of action to represent a variable precision / > scale decimal vector? In any case it does not seem possible to convert JDBC > data with the built in utilities that uses these numeric types when they come > in this form. > It seems like both the Oracle and the Postgres JDBC driver also returns > metadata with a 0,0 precision / scale when values in the result set have > different (and varied) precision / scale. > An example: > Against postgres, running a simple SQL query that produces numeric types can > lead to a JDBC result set with BigDecimal values with variable decimal > precision/scale. > {code:java} > SELECT value FROM ( > SELECT 1000.01 AS "value" > UNION SELECT 10300.001 > ) a {code} > > The postgres JDBC adapter produces a result set that looks like the > following: > > || ||value||precision||scale|| > |metadata|N/A|0|0| > |row 1|1000.01|18|2| > |row 2|10300.001|20|7| > > Even a result set that returns a single value may Numeric values with > precision / scale that do not match the precision / scale in the > ResultSetMetadata. > > {code:java} > SELECT AVG(one) from ( > SELECT 1000.01 as "one" > UNION select 10300.001 > ) a {code} > produces a result set that looks like this > > || ||value||precision||scale|| > |metadata|N/A|0|0| > |row 1|5005150.0050001|22|7| > > When processing the result set using the simple jdbcToArrowVectors (or > sqlToArrowVectorIterator) this fails to set the values extracted from the > result set into the the DecimalVector > > {code:java} > val calendar = JdbcToArrowUtils.getUtcCalendar() > val schema = JdbcToArrowUtils.jdbcToArrowSchema(rs.metaData, calendar) > val root = VectorSchemaRoot.create(schema, RootAllocator()) > val vectors = JdbcToArrowUtils.jdbcToArrowVectors(rs, root, calendar) {code} > Error: > > {code:java} > Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 0, > length: 1 (expected: range(0, 0)) > at org.apache.arrow.memory.ArrowBuf.checkIndexD(ArrowBuf.java:318) > at org.apache.arrow.memory.ArrowBuf.chk(ArrowBuf.java:305) > at org.apache.arrow.memory.ArrowBuf.getByte(ArrowBuf.java:507) > at org.apache.arrow.vector.BitVectorHelper.setBit(BitVectorHelper.java:85) > at org.apache.arrow.vector.DecimalVector.set(DecimalVector.java:354) > at > org.apache.arrow.adapter.jdbc.consumer.DecimalConsumer$NullableDecimalConsumer.consume(DecimalConsumer.java:61) > at > org.apache.arrow.adapter.jdbc.consumer.CompositeJdbcConsumer.consume(CompositeJdbcConsumer.java:46) > at > org.apache.arrow.adapter.jdbc.JdbcToArrowUtils.jdbcToArrowVectors(JdbcToArrowUtils.java:369) > at > org.apache.arrow.adapter.jdbc.JdbcToArrowUtils.jdbcToArrowVectors(JdbcToArrowUtils.java:321) > {code} > > using `sqlToArrowVectorIterator` also fails with an error trying to set data > into the vector: (requires a little bit of trickery to force creation of the > package private configuration) > > {code:java} > Exception in thread "main" java.lang.RuntimeException: Error occurred while > getting next schema root. > at > org.apache.arrow.adapter.jdbc.ArrowVectorIterator.next(ArrowVectorIterator.java:179) > at > com.acme.dataformat.ArrowResultSetProcessor.processResultSet(ArrowResultSetProcessor.kt:31) > at com.acme.AppKt.main(App.kt:54) > at com.acme.AppKt.main(App.kt) > Caused by: java.lang.RuntimeException: Error occurred while
[jira] [Created] (ARROW-16608) [Gandiva][Java] Unsatisfied Link Error on M1 Mac for aarch
Jonathan Swenson created ARROW-16608: Summary: [Gandiva][Java] Unsatisfied Link Error on M1 Mac for aarch Key: ARROW-16608 URL: https://issues.apache.org/jira/browse/ARROW-16608 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva, Java Affects Versions: 8.0.0 Reporter: Jonathan Swenson Potentially a blocker for Arrow Integration into Calcite: CALCITE-2040, however it may be possible to move forward without M1 Mac support. potentially somewhat related to ARROW-11135 Getting an instance of the JNILoader throw a Unsatisfied Link Error when it tries to load the libgandiva_jni.dylib that it has extracted from the jar into a temporary directory. Simplified error: {code:java} Exception in thread "main" java.lang.UnsatisfiedLinkError: /tmp_dir/libgandiva_jni.dylib_uuid: dlopen(/tmp_dir/libgandiva_jni.dylib_uuid, 0x0001): tried: '/tmp_dir/libgandiva_jni.dylib_uuid' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e')){code} Full error and stack trace: {code:java} Exception in thread "main" java.lang.UnsatisfiedLinkError: /private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe: dlopen(/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe, 0x0001): tried: '/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e')) at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1950) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1832) at java.lang.Runtime.load0(Runtime.java:811) at java.lang.System.load(System.java:1088) at org.apache.arrow.gandiva.evaluator.JniLoader.loadGandivaLibraryFromJar(JniLoader.java:74) at org.apache.arrow.gandiva.evaluator.JniLoader.setupInstance(JniLoader.java:63) at org.apache.arrow.gandiva.evaluator.JniLoader.getInstance(JniLoader.java:53) at org.apache.arrow.gandiva.evaluator.JniLoader.getDefaultConfiguration(JniLoader.java:144) at org.apache.arrow.gandiva.evaluator.Filter.make(Filter.java:67) at io.acme.Main.main(Main.java:26) {code} This example loads three libraries from mavencentral using gradle: {code:java} repositories { mavenCentral() } dependencies { implementation("org.apache.arrow:arrow-memory-netty:8.0.0") implementation("org.apache.arrow:arrow-vector:8.0.0") implementation("org.apache.arrow.gandiva:arrow-gandiva:8.0.0") } {code} Example code: {code:java} public class Main { public static void main(String[] args) throws GandivaException { Field field = new Field("int_field", FieldType.nullable(new ArrowType.Int(32, true)), null); Schema schema = makeSchema(field); Condition condition = makeCondition(field); Filter.make(schema, condition); } private static Schema makeSchema(Field field) { List fieldList = new ArrayList<>(); fieldList.add(field); return new Schema(fieldList, null); } private static Condition makeCondition(Field f) { List treeNodes = new ArrayList<>(2); treeNodes.add(TreeBuilder.makeField(f)); treeNodes.add(TreeBuilder.makeLiteral(4)); TreeNode comparison = TreeBuilder.makeFunction("less_than", treeNodes, new ArrowType.Bool()); return TreeBuilder.makeCondition(comparison); } } {code} While I haven't tested this exact example, a similar example executes without issue on an intel x86 mac. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16592) [FlightRPC][Python] Regression in DoPut error handling
[ https://issues.apache.org/jira/browse/ARROW-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-16592: - Component/s: FlightRPC Python > [FlightRPC][Python] Regression in DoPut error handling > -- > > Key: ARROW-16592 > URL: https://issues.apache.org/jira/browse/ARROW-16592 > Project: Apache Arrow > Issue Type: Bug > Components: FlightRPC, Python >Reporter: Lubo Slivka >Assignee: David Li >Priority: Major > > In PyArrow 8.0.0, any error raised while handling DoPut on the server results > in FlightInternalError on the client. > In PyArrow 7.0.0, errors raised while handling DoPut are propagated/converted > to non-internal errors. > — > Example: on 7.0.0, raising FlightCancelledError while handling DoPut on the > server would propagate that error including extra_info all the way to the > FlightClient. This is not the case anymore on 8.0.0. > The FlightInternalError contains extra detail that is derived from the > cancelled error though: > {code:java} > /arrow/cpp/src/arrow/flight/client.cc:363: Close() failed: IOError: message from FlightError is here>. Detail: Cancelled. gRPC client debug > context: {"created":"@1652777650.446052211","description":"Error received > from peer > ipv4:127.0.0.1:16001","file":"/opt/vcpkg/buildtrees/grpc/src/85a295989c-6cf7bf442d.clean/src/core/lib/surface/call.cc","file_line":903,"grpc_message":" message from FlightError is here>. Detail: Cancelled","grpc_status":1}. > Client context: OK. Detail: Cancelled > {code} > Note: skimming through the code, it seems this problem is not unique to > PyArrow. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16415) [R] Update strptime bindings to use tz
[ https://issues.apache.org/jira/browse/ARROW-16415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16415: --- Labels: pull-request-available (was: ) > [R] Update strptime bindings to use tz > --- > > Key: ARROW-16415 > URL: https://issues.apache.org/jira/browse/ARROW-16415 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 7.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > {{strptime}} mentions it does not support {{tz}} - the timezone argument. > ARROW-12820 has been addressed and the binding definition need updating. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16521) [C++][R] Configure curl timeout policy for S3
[ https://issues.apache.org/jira/browse/ARROW-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-16521: --- Labels: good-first-issue good-second-issue (was: ) > [C++][R] Configure curl timeout policy for S3 > - > > Key: ARROW-16521 > URL: https://issues.apache.org/jira/browse/ARROW-16521 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python, R >Affects Versions: 7.0.0 >Reporter: Carl Boettiger >Priority: Major > Labels: good-first-issue, good-second-issue > Fix For: 9.0.0 > > > Is it possible for the user to increase the timeout allowed on the curl > settings when accessing S3 records? The default setting appears to be more > aggressive than most other S3 clients I use, which means that I see a lot > more failures on arrow-based operations than the other clients see. I'm not > seeing how this can be increased though? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (ARROW-16521) [C++][R] Configure curl timeout policy for S3
[ https://issues.apache.org/jira/browse/ARROW-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-16521: -- Assignee: (was: Antoine Pitrou) > [C++][R] Configure curl timeout policy for S3 > - > > Key: ARROW-16521 > URL: https://issues.apache.org/jira/browse/ARROW-16521 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python, R >Affects Versions: 7.0.0 >Reporter: Carl Boettiger >Priority: Major > Fix For: 9.0.0 > > > Is it possible for the user to increase the timeout allowed on the curl > settings when accessing S3 records? The default setting appears to be more > aggressive than most other S3 clients I use, which means that I see a lot > more failures on arrow-based operations than the other clients see. I'm not > seeing how this can be increased though? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16521) [C++][R] Configure curl timeout policy for S3
[ https://issues.apache.org/jira/browse/ARROW-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538949#comment-17538949 ] Antoine Pitrou commented on ARROW-16521: We should make the timeout configurable from the API without having to set a CURL environment variable. Also, it looks like the AWS SDK's own {{DefaultRetryStrategy}} has a default timeout of around 25 seconds, so we should also use that by default. > [C++][R] Configure curl timeout policy for S3 > - > > Key: ARROW-16521 > URL: https://issues.apache.org/jira/browse/ARROW-16521 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python, R >Affects Versions: 7.0.0 >Reporter: Carl Boettiger >Assignee: Antoine Pitrou >Priority: Major > Fix For: 9.0.0 > > > Is it possible for the user to increase the timeout allowed on the curl > settings when accessing S3 records? The default setting appears to be more > aggressive than most other S3 clients I use, which means that I see a lot > more failures on arrow-based operations than the other clients see. I'm not > seeing how this can be increased though? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (ARROW-16521) [C++][R] Configure curl timeout policy for S3
[ https://issues.apache.org/jira/browse/ARROW-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-16521: -- Assignee: Antoine Pitrou > [C++][R] Configure curl timeout policy for S3 > - > > Key: ARROW-16521 > URL: https://issues.apache.org/jira/browse/ARROW-16521 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python, R >Affects Versions: 7.0.0 >Reporter: Carl Boettiger >Assignee: Antoine Pitrou >Priority: Major > Fix For: 9.0.0 > > > Is it possible for the user to increase the timeout allowed on the curl > settings when accessing S3 records? The default setting appears to be more > aggressive than most other S3 clients I use, which means that I see a lot > more failures on arrow-based operations than the other clients see. I'm not > seeing how this can be increased though? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16521) [C++][R] Configure curl timeout policy for S3
[ https://issues.apache.org/jira/browse/ARROW-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-16521: --- Component/s: Python > [C++][R] Configure curl timeout policy for S3 > - > > Key: ARROW-16521 > URL: https://issues.apache.org/jira/browse/ARROW-16521 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python, R >Affects Versions: 7.0.0 >Reporter: Carl Boettiger >Priority: Major > Fix For: 9.0.0 > > > Is it possible for the user to increase the timeout allowed on the curl > settings when accessing S3 records? The default setting appears to be more > aggressive than most other S3 clients I use, which means that I see a lot > more failures on arrow-based operations than the other clients see. I'm not > seeing how this can be increased though? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16521) [C++][R] Configure curl timeout policy for S3
[ https://issues.apache.org/jira/browse/ARROW-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-16521: --- Component/s: C++ > [C++][R] Configure curl timeout policy for S3 > - > > Key: ARROW-16521 > URL: https://issues.apache.org/jira/browse/ARROW-16521 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 7.0.0 >Reporter: Carl Boettiger >Priority: Major > Fix For: 9.0.0 > > > Is it possible for the user to increase the timeout allowed on the curl > settings when accessing S3 records? The default setting appears to be more > aggressive than most other S3 clients I use, which means that I see a lot > more failures on arrow-based operations than the other clients see. I'm not > seeing how this can be increased though? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16521) [C++][R] Configure curl timeout policy for S3
[ https://issues.apache.org/jira/browse/ARROW-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-16521: --- Fix Version/s: 9.0.0 > [C++][R] Configure curl timeout policy for S3 > - > > Key: ARROW-16521 > URL: https://issues.apache.org/jira/browse/ARROW-16521 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 7.0.0 >Reporter: Carl Boettiger >Priority: Major > Fix For: 9.0.0 > > > Is it possible for the user to increase the timeout allowed on the curl > settings when accessing S3 records? The default setting appears to be more > aggressive than most other S3 clients I use, which means that I see a lot > more failures on arrow-based operations than the other clients see. I'm not > seeing how this can be increased though? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16161) [C++] Overhead of std::shared_ptr copies is causing thread contention
[ https://issues.apache.org/jira/browse/ARROW-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538927#comment-17538927 ] Tobias Zagorni commented on ARROW-16161: I created a PR for avoiding calls to Slice() as ARROW-16562 > [C++] Overhead of std::shared_ptr copies is causing thread > contention > --- > > Key: ARROW-16161 > URL: https://issues.apache.org/jira/browse/ARROW-16161 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Weston Pace >Assignee: Tobias Zagorni >Priority: Major > Attachments: ExecArrayData-difference.txt > > > We created a benchmark to measure ExecuteScalarExpression performance in > ARROW-16014. We noticed significant thread contention (even though there > shouldn't be much, if any, for this task) As part of ARROW-16138 we have been > investigating possible causes. > One cause seems to be contention from copying shared_ptr objects. > Two possible solutions jump to mind and I'm sure there are many more. > ExecBatch is an internal type and used inside of ExecuteScalarExpression as > well as inside of the execution engine. In the former we can safely assume > the data types will exist for the duration of the call. In the latter we can > safely assume the data types will exist for the duration of the execution > plan. Thus we can probably take a more targetted fix and migrate only > ExecBatch to using DataType* (or const DataType&). > On the other hand, we might consider a more global approach. All of our > "stock" data types are assumed to have static storage duration. However, we > must use std::shared_ptr because users could create their own > extension types. We could invent an "extension type registration" system > where extension types must first be registered with the C++ lib before being > used. Then we could have long-lived DataType instances and we could replace > std::shared_ptr with DataType* (or const DataType&) throughout most > of the entire code base. > But, as I mentioned, I'm sure there are many approaches to take. CC > [~lidavidm] and [~apitrou] and [~yibocai] for thoughts but this might be > interesting for just about any C++ dev. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (ARROW-16415) [R] Update strptime bindings to use tz
[ https://issues.apache.org/jira/browse/ARROW-16415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld reassigned ARROW-16415: Assignee: Dragoș Moldovan-Grünfeld > [R] Update strptime bindings to use tz > --- > > Key: ARROW-16415 > URL: https://issues.apache.org/jira/browse/ARROW-16415 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 7.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 9.0.0 > > > {{strptime}} mentions it does not support {{tz}} - the timezone argument. > ARROW-12820 has been addressed and the binding definition need updating. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16606) [FlightRPC][Python] Flight RPC crashes when a middleware sends an authorization header written with an upper-case A as in 'Authorization'
[ https://issues.apache.org/jira/browse/ARROW-16606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538883#comment-17538883 ] David Li commented on ARROW-16606: -- Thanks for the report. I'll try to reproduce it soon but yes, headers should absolutely be case insensitive and we shouldn't be crashing in either case. > [FlightRPC][Python] Flight RPC crashes when a middleware sends an > authorization header written with an upper-case A as in 'Authorization' > -- > > Key: ARROW-16606 > URL: https://issues.apache.org/jira/browse/ARROW-16606 > Project: Apache Arrow > Issue Type: Bug > Components: FlightRPC, Python >Affects Versions: 7.0.0, 8.0.0 > Environment: Python 3.9.12 on macOS 12.3.1 >Reporter: Paul Horn >Assignee: David Li >Priority: Major > > Sending a custom `Authorization` header leads to a crash of the client > > Running this python code, for example > > {code:java} > import pyarrow.flight as flight > class TestMiddlewareFactory(ClientMiddlewareFactory): > def __init__(self, *args, **kwargs): > super().__init__(*args, **kwargs) > def start_call(self, info): > return TestMiddleware() > class TestMiddleware(ClientMiddleware): > def __init__(self, *args, **kwargs): > super().__init__(*args, **kwargs) > def sending_headers(self): > return {"Authorization": "Basic dXNlcjpwYXNz"} > def test(): > client = flight.FlightClient("grpc://localhost:8491", > middleware=[TestMiddlewareFactory()]) > client.do_get(flight.Ticket("")) > {code} > > > Results in > > > {noformat} > tests/rpc_repro.py Fatal Python error: AbortedCurrent thread > 0x000202ecc600 (most recent call first): > File "tests/rpc_repro.py", line 22 in test > File "venv/lib/python3.9/site-packages/_pytest/python.py", line 183 in > pytest_pyfunc_call > File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in > _multicall > File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in > _hookexec > File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in > __call__ > File "venv/lib/python3.9/site-packages/_pytest/python.py", line 1641 in > runtest > File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 162 in > pytest_runtest_call > File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in > _multicall > File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in > _hookexec > File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in > __call__ > File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 255 in > > File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 311 in > from_call > File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 254 in > call_runtest_hook > File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 215 in > call_and_report > File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 126 in > runtestprotocol > File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 109 in > pytest_runtest_protocol > File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in > _multicall > File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in > _hookexec > File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in > __call__ > File "venv/lib/python3.9/site-packages/_pytest/main.py", line 348 in > pytest_runtestloop > File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in > _multicall > File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in > _hookexec > File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in > __call__ > File "venv/lib/python3.9/site-packages/_pytest/main.py", line 323 in _main > File "venv/lib/python3.9/site-packages/_pytest/main.py", line 269 in > wrap_session > File "venv/lib/python3.9/site-packages/_pytest/main.py", line 316 in > pytest_cmdline_main > File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in > _multicall > File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in > _hookexec > File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in > __call__ > File "venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line > 162 in main > File "venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line > 185 in console_main > File "venv/bin/pytest", line 8 in > Abort trap: 6 {noformat} > > > With an additional crash report from the OS > > {noformat} > Process: Python [26728] > Path: >
[jira] [Assigned] (ARROW-16606) [FlightRPC][Python] Flight RPC crashes when a middleware sends an authorization header written with an upper-case A as in 'Authorization'
[ https://issues.apache.org/jira/browse/ARROW-16606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li reassigned ARROW-16606: Assignee: David Li > [FlightRPC][Python] Flight RPC crashes when a middleware sends an > authorization header written with an upper-case A as in 'Authorization' > -- > > Key: ARROW-16606 > URL: https://issues.apache.org/jira/browse/ARROW-16606 > Project: Apache Arrow > Issue Type: Bug > Components: FlightRPC, Python >Affects Versions: 7.0.0, 8.0.0 > Environment: Python 3.9.12 on macOS 12.3.1 >Reporter: Paul Horn >Assignee: David Li >Priority: Major > > Sending a custom `Authorization` header leads to a crash of the client > > Running this python code, for example > > {code:java} > import pyarrow.flight as flight > class TestMiddlewareFactory(ClientMiddlewareFactory): > def __init__(self, *args, **kwargs): > super().__init__(*args, **kwargs) > def start_call(self, info): > return TestMiddleware() > class TestMiddleware(ClientMiddleware): > def __init__(self, *args, **kwargs): > super().__init__(*args, **kwargs) > def sending_headers(self): > return {"Authorization": "Basic dXNlcjpwYXNz"} > def test(): > client = flight.FlightClient("grpc://localhost:8491", > middleware=[TestMiddlewareFactory()]) > client.do_get(flight.Ticket("")) > {code} > > > Results in > > > {noformat} > tests/rpc_repro.py Fatal Python error: AbortedCurrent thread > 0x000202ecc600 (most recent call first): > File "tests/rpc_repro.py", line 22 in test > File "venv/lib/python3.9/site-packages/_pytest/python.py", line 183 in > pytest_pyfunc_call > File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in > _multicall > File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in > _hookexec > File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in > __call__ > File "venv/lib/python3.9/site-packages/_pytest/python.py", line 1641 in > runtest > File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 162 in > pytest_runtest_call > File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in > _multicall > File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in > _hookexec > File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in > __call__ > File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 255 in > > File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 311 in > from_call > File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 254 in > call_runtest_hook > File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 215 in > call_and_report > File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 126 in > runtestprotocol > File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 109 in > pytest_runtest_protocol > File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in > _multicall > File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in > _hookexec > File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in > __call__ > File "venv/lib/python3.9/site-packages/_pytest/main.py", line 348 in > pytest_runtestloop > File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in > _multicall > File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in > _hookexec > File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in > __call__ > File "venv/lib/python3.9/site-packages/_pytest/main.py", line 323 in _main > File "venv/lib/python3.9/site-packages/_pytest/main.py", line 269 in > wrap_session > File "venv/lib/python3.9/site-packages/_pytest/main.py", line 316 in > pytest_cmdline_main > File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in > _multicall > File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in > _hookexec > File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in > __call__ > File "venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line > 162 in main > File "venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line > 185 in console_main > File "venv/bin/pytest", line 8 in > Abort trap: 6 {noformat} > > > With an additional crash report from the OS > > {noformat} > Process: Python [26728] > Path: > /usr/local/Cellar/python@3.9/3.9.12/Frameworks/Python.framework/Versions/3.9/Resources/Python.app/Contents/MacOS/Python > Identifier: org.python.python > Version: 3.9.12 (3.9.12) > Code Type:
[jira] [Updated] (ARROW-16606) [FlightRPC][Python] Flight RPC crashes when a middleware sends an authorization header written with an upper-case A as in 'Authorization'
[ https://issues.apache.org/jira/browse/ARROW-16606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Horn updated ARROW-16606: -- Description: Sending a custom `Authorization` header leads to a crash of the client Running this python code, for example {code:java} import pyarrow.flight as flight class TestMiddlewareFactory(ClientMiddlewareFactory): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) def start_call(self, info): return TestMiddleware() class TestMiddleware(ClientMiddleware): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) def sending_headers(self): return {"Authorization": "Basic dXNlcjpwYXNz"} def test(): client = flight.FlightClient("grpc://localhost:8491", middleware=[TestMiddlewareFactory()]) client.do_get(flight.Ticket("")) {code} Results in {noformat} tests/rpc_repro.py Fatal Python error: AbortedCurrent thread 0x000202ecc600 (most recent call first): File "tests/rpc_repro.py", line 22 in test File "venv/lib/python3.9/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__ File "venv/lib/python3.9/site-packages/_pytest/python.py", line 1641 in runtest File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__ File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 255 in File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 311 in from_call File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 254 in call_runtest_hook File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 215 in call_and_report File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 126 in runtestprotocol File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__ File "venv/lib/python3.9/site-packages/_pytest/main.py", line 348 in pytest_runtestloop File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__ File "venv/lib/python3.9/site-packages/_pytest/main.py", line 323 in _main File "venv/lib/python3.9/site-packages/_pytest/main.py", line 269 in wrap_session File "venv/lib/python3.9/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__ File "venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line 162 in main File "venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line 185 in console_main File "venv/bin/pytest", line 8 in Abort trap: 6 {noformat} With an additional crash report from the OS {noformat} Process: Python [26728] Path: /usr/local/Cellar/python@3.9/3.9.12/Frameworks/Python.framework/Versions/3.9/Resources/Python.app/Contents/MacOS/Python Identifier: org.python.python Version: 3.9.12 (3.9.12) Code Type: X86-64 (Translated) Parent Process: bash [4683] Responsible: iTerm2 [99236] User ID: 501Date/Time: 2022-05-18 15:35:10.1978 +0200 OS Version: macOS 12.3.1 (21E258) Report Version: 12 Anonymous UUID: 4A72633D-06AC-F2CE-0E3F-0AD87FA611CESleep/Wake UUID: 3D7BD416-99A9-41B3-8163-5544AEF31FF5Time Awake Since Boot: 100 seconds Time Since Wake: 22827 secondsSystem Integrity Protection: enabledCrashed Thread: 0 Dispatch queue: com.apple.main-threadException Type: EXC_CRASH (SIGABRT) Exception Codes: 0x, 0x Exception Note: EXC_CORPSE_NOTIFYApplication Specific Information: abort() called Thread 0 Crashed:: Dispatch queue: com.apple.main-thread 0 ??? 0x7ff8a597e940 ??? 1 libsystem_kernel.dylib
[jira] [Created] (ARROW-16607) [R] Improve KeyValueMetadata handling
Neal Richardson created ARROW-16607: --- Summary: [R] Improve KeyValueMetadata handling Key: ARROW-16607 URL: https://issues.apache.org/jira/browse/ARROW-16607 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 9.0.0 Followup to ARROW-15271. Among the objectives: * Push KVM handling in ExecPlan so that Run() preserves the R metadata we want; also remove the duplicate handling of it for Write() * Better encapsulate KVM for the the $metadata and $r_metadata so that as a user/developer, you never have to touch the serialize/deserialize functions, you just have a list to work with * Factor out a common utility in r/src for taking cpp11::strings (named character vector) and producing arrow::KeyValueMetadata -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16606) [FlightRPC][Python] Flight RPC crashes when a middleware sends an authorization header written with an upper-case A as in 'Authorization'
Paul Horn created ARROW-16606: - Summary: [FlightRPC][Python] Flight RPC crashes when a middleware sends an authorization header written with an upper-case A as in 'Authorization' Key: ARROW-16606 URL: https://issues.apache.org/jira/browse/ARROW-16606 Project: Apache Arrow Issue Type: Bug Components: FlightRPC, Python Affects Versions: 8.0.0, 7.0.0 Environment: Python 3.9.12 on macOS 12.3.1 Reporter: Paul Horn Sending a custom `Authorization` header leads to a crash of the client Running this python code, for example {code:java} import pyarrow.flight as flight class TestMiddlewareFactory(ClientMiddlewareFactory): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) def start_call(self, info): return TestMiddleware() class TestMiddleware(ClientMiddleware): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) def sending_headers(self): return {"Authorization": "Basic dXNlcjpwYXNz"} def test(): client = flight.FlightClient("grpc://localhost:8491", middleware=[TestMiddlewareFactory()]) client.do_get(flight.Ticket("")) {code} Results in {noformat} tests/rpc_repro.py Fatal Python error: AbortedCurrent thread 0x000202ecc600 (most recent call first): File "tests/rpc_repro.py", line 22 in test File "venv/lib/python3.9/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__ File "venv/lib/python3.9/site-packages/_pytest/python.py", line 1641 in runtest File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__ File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 255 in File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 311 in from_call File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 254 in call_runtest_hook File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 215 in call_and_report File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 126 in runtestprotocol File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__ File "venv/lib/python3.9/site-packages/_pytest/main.py", line 348 in pytest_runtestloop File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__ File "venv/lib/python3.9/site-packages/_pytest/main.py", line 323 in _main File "venv/lib/python3.9/site-packages/_pytest/main.py", line 269 in wrap_session File "venv/lib/python3.9/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__ File "venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line 162 in main File "venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line 185 in console_main File "venv/bin/pytest", line 8 in Abort trap: 6 {noformat} With an additional crash report from the OS {noformat} Process: Python [26728]Path: /usr/local/Cellar/python@3.9/3.9.12/Frameworks/Python.framework/Versions/3.9/Resources/Python.app/Contents/MacOS/PythonIdentifier: org.python.pythonVersion: 3.9.12 (3.9.12)Code Type: X86-64 (Translated)Parent Process: bash [4683]Responsible: iTerm2 [99236]User ID: 501 Date/Time: 2022-05-18 15:35:10.1978 +0200OS Version: macOS 12.3.1 (21E258)Report Version: 12Anonymous UUID: 4A72633D-06AC-F2CE-0E3F-0AD87FA611CE Sleep/Wake UUID: 3D7BD416-99A9-41B3-8163-5544AEF31FF5 Time Awake Since Boot: 100 secondsTime Since Wake: 22827 seconds System Integrity Protection: enabled Crashed Thread: 0 Dispatch queue: com.apple.main-thread
[jira] [Updated] (ARROW-16415) [R] Update strptime bindings to use tz
[ https://issues.apache.org/jira/browse/ARROW-16415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld updated ARROW-16415: - Description: {{strptime}} mentions it does not support {{tz}} - the timezone argument. ARROW-12820 has been addressed and the binding definition need updating. (was: Both functions mention they do not support {{tz}} - the timezone argument. ARROW-12820 has been addressed and the bindings definitions need updating.) > [R] Update strptime bindings to use tz > --- > > Key: ARROW-16415 > URL: https://issues.apache.org/jira/browse/ARROW-16415 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 7.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 9.0.0 > > > {{strptime}} mentions it does not support {{tz}} - the timezone argument. > ARROW-12820 has been addressed and the binding definition need updating. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16415) [R] Update strptime bindings to use tz
[ https://issues.apache.org/jira/browse/ARROW-16415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld updated ARROW-16415: - Summary: [R] Update strptime bindings to use tz (was: [R] Update strptime and fast_strptime bindings to use tz ) > [R] Update strptime bindings to use tz > --- > > Key: ARROW-16415 > URL: https://issues.apache.org/jira/browse/ARROW-16415 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 7.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 9.0.0 > > > Both functions mention they do not support {{tz}} - the timezone argument. > ARROW-12820 has been addressed and the bindings definitions need updating. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16317) [Archery][CI] Fix possible race condition when submitting crossbow builds
[ https://issues.apache.org/jira/browse/ARROW-16317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16317: --- Labels: pull-request-available (was: ) > [Archery][CI] Fix possible race condition when submitting crossbow builds > - > > Key: ARROW-16317 > URL: https://issues.apache.org/jira/browse/ARROW-16317 > Project: Apache Arrow > Issue Type: Bug > Components: Archery, Continuous Integration >Reporter: Raúl Cumplido >Assignee: Raúl Cumplido >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Sometimes when trying to use github-actions to submit crossbow jobs an error > is raised like: > {code:java} > Failed to push updated references, potentially because of credential issues: > ['refs/heads/actions-1883-github-wheel-windows-cp310-amd64', > 'refs/tags/actions-1883-github-wheel-windows-cp310-amd64', > 'refs/heads/actions-1883-github-wheel-windows-cp39-amd64', > 'refs/tags/actions-1883-github-wheel-windows-cp39-amd64', > 'refs/heads/actions-1883-github-wheel-windows-cp37-amd64', > 'refs/tags/actions-1883-github-wheel-windows-cp37-amd64', > 'refs/heads/actions-1883-github-wheel-windows-cp38-amd64', > 'refs/tags/actions-1883-github-wheel-windows-cp38-amd64', > 'refs/heads/actions-1883'] > The Archery job run can be found at: > https://github.com/apache/arrow/actions/runs/2195038965{code} > As discussed on this github comment > ([https://github.com/apache/arrow/pull/12930#issuecomment-1103772507)] > We should remove the auto incremented IDs entirely and use unique hashes > instead, e.g.: actions--github-wheel-windows-cp310-amd64 instead > of actions-1883-github-wheel-windows-cp310-amd64. Then we wouldn't need to > fetch the new references either, making remote crossbow builds and local > submission much quicker. > The error can also be seen here: > https://github.com/apache/arrow/pull/12987#issuecomment-1108516668 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (ARROW-16317) [Archery][CI] Fix possible race condition when submitting crossbow builds
[ https://issues.apache.org/jira/browse/ARROW-16317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido reassigned ARROW-16317: - Assignee: Raúl Cumplido > [Archery][CI] Fix possible race condition when submitting crossbow builds > - > > Key: ARROW-16317 > URL: https://issues.apache.org/jira/browse/ARROW-16317 > Project: Apache Arrow > Issue Type: Bug > Components: Archery, Continuous Integration >Reporter: Raúl Cumplido >Assignee: Raúl Cumplido >Priority: Major > Fix For: 9.0.0 > > > Sometimes when trying to use github-actions to submit crossbow jobs an error > is raised like: > {code:java} > Failed to push updated references, potentially because of credential issues: > ['refs/heads/actions-1883-github-wheel-windows-cp310-amd64', > 'refs/tags/actions-1883-github-wheel-windows-cp310-amd64', > 'refs/heads/actions-1883-github-wheel-windows-cp39-amd64', > 'refs/tags/actions-1883-github-wheel-windows-cp39-amd64', > 'refs/heads/actions-1883-github-wheel-windows-cp37-amd64', > 'refs/tags/actions-1883-github-wheel-windows-cp37-amd64', > 'refs/heads/actions-1883-github-wheel-windows-cp38-amd64', > 'refs/tags/actions-1883-github-wheel-windows-cp38-amd64', > 'refs/heads/actions-1883'] > The Archery job run can be found at: > https://github.com/apache/arrow/actions/runs/2195038965{code} > As discussed on this github comment > ([https://github.com/apache/arrow/pull/12930#issuecomment-1103772507)] > We should remove the auto incremented IDs entirely and use unique hashes > instead, e.g.: actions--github-wheel-windows-cp310-amd64 instead > of actions-1883-github-wheel-windows-cp310-amd64. Then we wouldn't need to > fetch the new references either, making remote crossbow builds and local > submission much quicker. > The error can also be seen here: > https://github.com/apache/arrow/pull/12987#issuecomment-1108516668 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (ARROW-16582) [Python] Include DATASET in list of components in PyArrow's dev page
[ https://issues.apache.org/jira/browse/ARROW-16582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido reassigned ARROW-16582: - Assignee: Raúl Cumplido > [Python] Include DATASET in list of components in PyArrow's dev page > > > Key: ARROW-16582 > URL: https://issues.apache.org/jira/browse/ARROW-16582 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Python >Reporter: Yaron Gvili >Assignee: Raúl Cumplido >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > PyArrow's dev page has a [build-and-test > section|https://arrow.apache.org/docs/developers/python.html#build-and-test] > that currently does not list DATASET as a component. Using a recent Arrow > version (commit e5e490), I observed DATASET was mandatory for the successful > completion of the test suite ran by `{color:#201f1e}python -m pytest > pyarrow/{color}`, as recommended on the page. Without `export > PYARROW_WITH_DATASET=1`, I observed errors with `test_dataset.py`, > `test_exec_plan.py`, and a couple others. > Since DATASET is intended to be an optional component, it should be listed on > this section. In addition, the documented test suite command should be > updated to one that doesn't fail without the DATASET component being selected > (or else the test suite itself should be fixed). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16582) [Python] Include DATASET in list of components in PyArrow's dev page
[ https://issues.apache.org/jira/browse/ARROW-16582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16582: --- Labels: pull-request-available (was: ) > [Python] Include DATASET in list of components in PyArrow's dev page > > > Key: ARROW-16582 > URL: https://issues.apache.org/jira/browse/ARROW-16582 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Python >Reporter: Yaron Gvili >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > PyArrow's dev page has a [build-and-test > section|https://arrow.apache.org/docs/developers/python.html#build-and-test] > that currently does not list DATASET as a component. Using a recent Arrow > version (commit e5e490), I observed DATASET was mandatory for the successful > completion of the test suite ran by `{color:#201f1e}python -m pytest > pyarrow/{color}`, as recommended on the page. Without `export > PYARROW_WITH_DATASET=1`, I observed errors with `test_dataset.py`, > `test_exec_plan.py`, and a couple others. > Since DATASET is intended to be an optional component, it should be listed on > this section. In addition, the documented test suite command should be > updated to one that doesn't fail without the DATASET component being selected > (or else the test suite itself should be fixed). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16605) [CI][R] Fix revdep Crossbow job
Jacob Wujciak-Jens created ARROW-16605: -- Summary: [CI][R] Fix revdep Crossbow job Key: ARROW-16605 URL: https://issues.apache.org/jira/browse/ARROW-16605 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Jacob Wujciak-Jens Assignee: Jacob Wujciak-Jens Fix For: 9.0.0 The revdep Crossbow job is currently not functioning correctly. This led to changed behaviour affecting a revdep with the 8.0.0 release, requiring a patch after initial submission. cc: [~jonkeane] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16604) [C++] Boost not included when build benchmarks
Yibo Cai created ARROW-16604: Summary: [C++] Boost not included when build benchmarks Key: ARROW-16604 URL: https://issues.apache.org/jira/browse/ARROW-16604 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Yibo Cai {code:bash} cmake -GNinja -DARROW_BUILD_BENCHMARKS=ON .. {code} failed with many boost related error, as below {code:bash} CMake Error at cmake_modules/BuildUtils.cmake:522 (add_executable): Target "arrow-json-parser-benchmark" links to target "Boost::system" but the target was not found. Perhaps a find_package() call is missing for an IMPORTED target, or an ALIAS target is missing? Call Stack (most recent call first): src/arrow/CMakeLists.txt:114 (add_benchmark) src/arrow/json/CMakeLists.txt:28 (add_arrow_benchmark) {code} The error is gone if also build tests {{-DARROW_BUILD_TESTS=ON}}. Looks boost is not included when build benchmarks. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-14314) [C++] Sorting dictionary array not implemented
[ https://issues.apache.org/jira/browse/ARROW-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538702#comment-17538702 ] Antoine Pitrou commented on ARROW-14314: bq. But to avoid that, we can replace null values into indices, so the problem will look like this: We can indeed. Another possibility is to partition nulls away first, then work on non-null values (partitioning is how the sorting implementation already deals with null values for other data types). That might be a bit faster as well. bq. btw, why do we allow nulls in values? Shouldn't it be easier to have them only in indices? Probably for compatibility with various data sources. > [C++] Sorting dictionary array not implemented > -- > > Key: ARROW-14314 > URL: https://issues.apache.org/jira/browse/ARROW-14314 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Neal Richardson >Priority: Major > Labels: kernel > Fix For: 9.0.0 > > > From R, taking the stock {{mtcars}} dataset and giving it a dictionary type > column: > {code} > mtcars %>% > mutate(cyl = as.factor(cyl)) %>% > Table$create() %>% > arrange(cyl) %>% > collect() > Error: Type error: Sorting not supported for type dictionary indices=int8, ordered=0> > ../src/arrow/compute/kernels/vector_array_sort.cc:427 VisitTypeInline(type, > this) > ../src/arrow/compute/kernels/vector_sort.cc:148 > GetArraySorter(*physical_type_) > ../src/arrow/compute/kernels/vector_sort.cc:1206 sorter.Sort() > ../src/arrow/compute/api_vector.cc:259 CallFunction("sort_indices", {datum}, > , ctx) > ../src/arrow/compute/exec/order_by_impl.cc:53 SortIndices(table, options_, > ctx_) > ../src/arrow/compute/exec/sink_node.cc:292 impl_->DoFinish() > ../src/arrow/compute/exec/exec_plan.cc:297 iterator_.Next() > ../src/arrow/record_batch.cc:318 ReadNext() > ../src/arrow/record_batch.cc:329 ReadAll() > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16516) [R] Implement ym() my() and yq() parsers
[ https://issues.apache.org/jira/browse/ARROW-16516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane resolved ARROW-16516. -- Resolution: Fixed Issue resolved by pull request 13163 [https://github.com/apache/arrow/pull/13163] > [R] Implement ym() my() and yq() parsers > > > Key: ARROW-16516 > URL: https://issues.apache.org/jira/browse/ARROW-16516 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16603) [Python] pyarrow.json.read_json ignores nullable=False in explicit_schema parse_options
Alenka Frim created ARROW-16603: --- Summary: [Python] pyarrow.json.read_json ignores nullable=False in explicit_schema parse_options Key: ARROW-16603 URL: https://issues.apache.org/jira/browse/ARROW-16603 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Alenka Frim Reproducible example: {code:python} import json import pyarrow.json as pj import pyarrow as pa s = {"id": "value", "nested": {"value": 1}} with open("issue.json", "w") as write_file: json.dump(s, write_file, indent=4) schema = pa.schema([ pa.field("id", pa.string(), nullable=False), pa.field("nested", pa.struct([pa.field("value", pa.int64(), nullable=False)])) ]) table = pj.read_json('issue.json', parse_options=pj.ParseOptions(explicit_schema=schema)) print(schema) print(table.schema) {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16582) [Python] Include DATASET in list of components in PyArrow's dev page
[ https://issues.apache.org/jira/browse/ARROW-16582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538603#comment-17538603 ] Yaron Gvili commented on ARROW-16582: - Another possible fix is for the build to automatically select DATASET if some other component, like PARQUET, is selected. > [Python] Include DATASET in list of components in PyArrow's dev page > > > Key: ARROW-16582 > URL: https://issues.apache.org/jira/browse/ARROW-16582 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Python >Reporter: Yaron Gvili >Priority: Major > Fix For: 9.0.0 > > > PyArrow's dev page has a [build-and-test > section|https://arrow.apache.org/docs/developers/python.html#build-and-test] > that currently does not list DATASET as a component. Using a recent Arrow > version (commit e5e490), I observed DATASET was mandatory for the successful > completion of the test suite ran by `{color:#201f1e}python -m pytest > pyarrow/{color}`, as recommended on the page. Without `export > PYARROW_WITH_DATASET=1`, I observed errors with `test_dataset.py`, > `test_exec_plan.py`, and a couple others. > Since DATASET is intended to be an optional component, it should be listed on > this section. In addition, the documented test suite command should be > updated to one that doesn't fail without the DATASET component being selected > (or else the test suite itself should be fixed). -- This message was sent by Atlassian Jira (v8.20.7#820007)