Re: [I] [CI][Python] verify-rc python nightly builds fail due to missing some flags that were activated with ARROW_PYTHON=ON [arrow]

2025-03-01 Thread via GitHub
pareshnagendrapaul-git commented on issue #33092: URL: https://github.com/apache/arrow/issues/33092#issuecomment-2692604510 missing header file arrow/csv/api.h error when building pyarrow from source. please give a solution. -- This is an automated message from the Apache Git Service. To

Re: [PR] GH-45522: [Parquet][C++] Parquet GEOMETRY and GEOGRAPHY logical type implementations [arrow]

2025-03-01 Thread via GitHub
paleolimbot commented on code in PR #45459: URL: https://github.com/apache/arrow/pull/45459#discussion_r1975909166 ## cpp/src/parquet/types.cc: ## @@ -1619,6 +1668,204 @@ class LogicalType::Impl::Float16 final : public LogicalType::Impl::Incompatible, GENERATE_MAKE(Float16)

Re: [I] Docs for Building from source need to be updated [arrow-java]

2025-03-01 Thread via GitHub
jbonofre commented on issue #640: URL: https://github.com/apache/arrow-java/issues/640#issuecomment-2692584779 I will do a pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

Re: [PR] feat(csharp/src/Drivers/BigQuery): choose the first project ID if not specified [arrow-adbc]

2025-03-01 Thread via GitHub
davidhcoe commented on code in PR #2541: URL: https://github.com/apache/arrow-adbc/pull/2541#discussion_r1976543777 ## csharp/src/Drivers/BigQuery/BigQueryStatement.cs: ## @@ -51,7 +51,26 @@ public BigQueryStatement(BigQueryClient client, GoogleCredential credential)

Re: [PR] feat(csharp/src/Drivers/BigQuery): choose the first project ID if not specified [arrow-adbc]

2025-03-01 Thread via GitHub
davidhcoe commented on code in PR #2541: URL: https://github.com/apache/arrow-adbc/pull/2541#discussion_r1976543907 ## csharp/test/Apache.Arrow.Adbc.Tests/MultiEnvironmentTestUtils.cs: ## @@ -0,0 +1,90 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] GH-45614: [C++] Use Boost's CMake packages instead of FindBoost.cmake in CMake [arrow]

2025-03-01 Thread via GitHub
kou commented on PR #45623: URL: https://github.com/apache/arrow/pull/45623#issuecomment-2691946763 @github-actions crossbow submit -g wheel -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [PR] GH-35289: [Python] Support large variable width types in numpy conversion [arrow]

2025-03-01 Thread via GitHub
rok commented on code in PR #36701: URL: https://github.com/apache/arrow/pull/36701#discussion_r1975761191 ## python/pyarrow/tests/test_array.py: ## @@ -3286,24 +3295,34 @@ def test_numpy_binary_overflow_to_chunked(): unicode_values += [unicode_unique_strings[i % 10]

Re: [PR] [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer [arrow]

2025-03-01 Thread via GitHub
kou commented on code in PR #45360: URL: https://github.com/apache/arrow/pull/45360#discussion_r1976514402 ## cpp/src/parquet/column_chunker.cc: ## @@ -0,0 +1,778 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See t

Re: [I] [R] Handling large data [arrow]

2025-03-01 Thread via GitHub
ag1805x commented on issue #45645: URL: https://github.com/apache/arrow/issues/45645#issuecomment-2692456856 It seems it is trying to load the data in RAM. Memory usage in Rstudio goes upto about 200GB and then crashes. I tried putting a connection to file in to_duckdb() but does not benefi

Re: [PR] [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer [arrow]

2025-03-01 Thread via GitHub
kszucs commented on code in PR #45360: URL: https://github.com/apache/arrow/pull/45360#discussion_r1976493385 ## cpp/src/parquet/column_chunker.cc: ## @@ -0,0 +1,778 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. Se

Re: [PR] [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer [arrow]

2025-03-01 Thread via GitHub
kszucs commented on code in PR #45360: URL: https://github.com/apache/arrow/pull/45360#discussion_r1976493402 ## cpp/src/parquet/column_chunker.cc: ## @@ -0,0 +1,778 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. Se

Re: [PR] [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer [arrow]

2025-03-01 Thread via GitHub
kszucs commented on code in PR #45360: URL: https://github.com/apache/arrow/pull/45360#discussion_r1976493429 ## cpp/src/parquet/column_chunker.cc: ## @@ -0,0 +1,773 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. Se

Re: [I] Release arrow-rs / parquet patch version 54.2.1 (Feb 2025) (HOTFIX) [arrow-rs]

2025-03-01 Thread via GitHub
alamb commented on issue #7209: URL: https://github.com/apache/arrow-rs/issues/7209#issuecomment-2692421114 Here is my summary of what happened here for anyone who is curious: 1. an arrow-rs dependency ('chrono', a Rust date/time library) released a version [`0.4.40`](https://crates.io/cr

Re: [I] [R] Handling large data [arrow]

2025-03-01 Thread via GitHub
amoeba commented on issue #45645: URL: https://github.com/apache/arrow/issues/45645#issuecomment-2692396514 I'd suggest trying to_duckdb before any of your dplyr operations. So something like, ```r ds <- open_dataset(...) ds |> to_duckdb() |> group_by(...) |>

Re: [I] [R] Handling large data [arrow]

2025-03-01 Thread via GitHub
ag1805x commented on issue #45645: URL: https://github.com/apache/arrow/issues/45645#issuecomment-2692390824 Hello @amoeba! Each group, by combining `sample_id, gene1, gene2`, will have 10 rows each. What I understood is it is the `filter()` within `mclappy` that is the current bottleneck.

[PR] MINOR: [JS] Bump typedoc from 0.27.6 to 0.27.9 in /js [arrow]

2025-03-01 Thread via GitHub
dependabot[bot] opened a new pull request, #45646: URL: https://github.com/apache/arrow/pull/45646 Bumps [typedoc](https://github.com/TypeStrong/TypeDoc) from 0.27.6 to 0.27.9. Release notes Sourced from https://github.com/TypeStrong/TypeDoc/releases";>typedoc's releases. v0

[PR] MINOR: [JS] Bump ts-jest from 29.1.4 to 29.2.6 in /js [arrow]

2025-03-01 Thread via GitHub
dependabot[bot] opened a new pull request, #45648: URL: https://github.com/apache/arrow/pull/45648 Bumps [ts-jest](https://github.com/kulshekhar/ts-jest) from 29.1.4 to 29.2.6. Release notes Sourced from https://github.com/kulshekhar/ts-jest/releases";>ts-jest's releases. v2

[PR] MINOR: [JS] Bump eslint-plugin-jest from 28.9.0 to 28.11.0 in /js [arrow]

2025-03-01 Thread via GitHub
dependabot[bot] opened a new pull request, #45647: URL: https://github.com/apache/arrow/pull/45647 Bumps [eslint-plugin-jest](https://github.com/jest-community/eslint-plugin-jest) from 28.9.0 to 28.11.0. Release notes Sourced from https://github.com/jest-community/eslint-plugin-je

Re: [I] [R] Handling large data [arrow]

2025-03-01 Thread via GitHub
amoeba commented on issue #45645: URL: https://github.com/apache/arrow/issues/45645#issuecomment-2692304027 Hi @ag1805x, how many rows and cols do you expect that summarize call to yield (i.e., how big are your groups)? One thing you might try would be to take advantage of our seamless conv

[I] The relative path of the test data is incorrect [arrow-rs]

2025-03-01 Thread via GitHub
Ziy1-Tan opened a new issue, #7222: URL: https://github.com/apache/arrow-rs/issues/7222 **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** **Describe the solution you'd like** **Describe alternatives you've consider

[PR] Update the relative path of the test data [arrow-rs]

2025-03-01 Thread via GitHub
Ziy1-Tan opened a new pull request, #7221: URL: https://github.com/apache/arrow-rs/pull/7221 # Which issue does this PR close? Closes #. # Rationale for this change # What changes are included in this PR? # Are there any user-facing changes

Re: [PR] GH-45594: [C++][Parquet] POC: Optimize Parquet DecodeArrow in DeltaLengthByteArray [arrow]

2025-03-01 Thread via GitHub
mapleFU commented on code in PR #45622: URL: https://github.com/apache/arrow/pull/45622#discussion_r1976426404 ## cpp/src/arrow/array/builder_binary.h: ## @@ -303,6 +304,77 @@ class BaseBinaryBuilder return Status::OK(); } + Status AppendBinaryWithLengths(std::string_

Re: [PR] GH-45619: [Python] Use f-string instead of string.format [arrow]

2025-03-01 Thread via GitHub
chilin0525 commented on PR #45629: URL: https://github.com/apache/arrow/pull/45629#issuecomment-2692247948 I have already changed all the files under the pyarrow folder. As discussed in https://github.com/apache/arrow/issues/45619, certain scenarios where the template is reused across multi

Re: [PR] GH-45522: [Parquet][C++] Parquet GEOMETRY and GEOGRAPHY logical type implementations [arrow]

2025-03-01 Thread via GitHub
paleolimbot commented on code in PR #45459: URL: https://github.com/apache/arrow/pull/45459#discussion_r1975881392 ## cpp/src/parquet/geometry_util_internal.h: ## @@ -0,0 +1,184 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agr

Re: [PR] GH-45522: [Parquet][C++] Parquet GEOMETRY and GEOGRAPHY logical type implementations [arrow]

2025-03-01 Thread via GitHub
paleolimbot commented on code in PR #45459: URL: https://github.com/apache/arrow/pull/45459#discussion_r1975806009 ## cpp/src/parquet/geometry_statistics_test.cc: ## @@ -0,0 +1,192 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license

Re: [PR] GH-45522: [Parquet][C++] Parquet GEOMETRY and GEOGRAPHY logical type implementations [arrow]

2025-03-01 Thread via GitHub
paleolimbot commented on code in PR #45459: URL: https://github.com/apache/arrow/pull/45459#discussion_r1975902982 ## cpp/src/parquet/test_util.h: ## @@ -830,5 +833,124 @@ inline void GenerateData(int num_values, FLBA* out, std::vector* random_fixed_byte_array(num_values, 0,

Re: [PR] GH-45522: [Parquet][C++] Parquet GEOMETRY and GEOGRAPHY logical type implementations [arrow]

2025-03-01 Thread via GitHub
paleolimbot commented on code in PR #45459: URL: https://github.com/apache/arrow/pull/45459#discussion_r1975896373 ## cpp/src/parquet/reader_test.cc: ## @@ -1857,4 +1863,172 @@ TEST(PageIndexReaderTest, ReadFileWithoutPageIndex) { ASSERT_EQ(nullptr, row_group_index_reader);

Re: [PR] Experimental parquet decoder with first-class selection pushdown support [arrow-rs]

2025-03-01 Thread via GitHub
bharath-techie commented on code in PR #6921: URL: https://github.com/apache/arrow-rs/pull/6921#discussion_r1976408013 ## parquet/src/arrow/async_reader/arrow_reader.rs: ## @@ -0,0 +1,574 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor l

Re: [PR] Experimental parquet decoder with first-class selection pushdown support [arrow-rs]

2025-03-01 Thread via GitHub
bharath-techie commented on PR #6921: URL: https://github.com/apache/arrow-rs/pull/6921#issuecomment-2692201208 Hi @XiangpengHao , I'm sure this is still work in progress. But we're encountering extremely slow IO pushdown queries when we were doing POC with datafusion / arrow-rs fo

Re: [I] [Python] Use f-string instead of `string.format` [arrow]

2025-03-01 Thread via GitHub
chilin0525 commented on issue #45619: URL: https://github.com/apache/arrow/issues/45619#issuecomment-2692109777 Hi @kou , I'm still working on the PR to refactor all instances of string.format. However, in some cases, I'm unsure whether they should also be refactored. For example, in

Re: [I] [Python] Use f-string instead of `string.format` [arrow]

2025-03-01 Thread via GitHub
kou commented on issue #45619: URL: https://github.com/apache/arrow/issues/45619#issuecomment-2692110989 We don't need to use f-string for it. If we use f-string for it, we can't reuse the template in multiple methods. -- This is an automated message from the Apache Git Service. To respon

Re: [PR] GH-45269: [C++][Compute] Add "pivot_wider" and "hash_pivot_wider" functions [arrow]

2025-03-01 Thread via GitHub
pitrou commented on code in PR #45562: URL: https://github.com/apache/arrow/pull/45562#discussion_r1976370007 ## cpp/src/arrow/compute/api_aggregate.h: ## @@ -175,6 +175,88 @@ class ARROW_EXPORT TDigestOptions : public FunctionOptions { uint32_t min_count; }; +/// \brief