[jira] [Resolved] (ARROW-16818) [Doc][Python] Document GCS filesystem for PyArrow
[ https://issues.apache.org/jira/browse/ARROW-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alenka Frim resolved ARROW-16818. - Resolution: Fixed Issue resolved by pull request 13681 [https://github.com/apache/arrow/pull/13681] > [Doc][Python] Document GCS filesystem for PyArrow > - > > Key: ARROW-16818 > URL: https://issues.apache.org/jira/browse/ARROW-16818 > Project: Apache Arrow > Issue Type: Task > Components: Documentation, Python >Reporter: Antoine Pitrou >Assignee: Rok Mihevc >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Followup to ARROW-14892. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17181) [Docs][Python] Scalar UDF Experimental Documentation
[ https://issues.apache.org/jira/browse/ARROW-17181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17181: --- Labels: pull-request-available (was: ) > [Docs][Python] Scalar UDF Experimental Documentation > > > Key: ARROW-17181 > URL: https://issues.apache.org/jira/browse/ARROW-17181 > Project: Apache Arrow > Issue Type: Sub-task > Components: Documentation, Python >Affects Versions: 9.0.0 >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > At the moment the existing Scalar UDF usage is not documented. There will be > a final version of documentation update once other features are integrated. > But to support the users and developers, the existing content needs to be > documented. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17159) [C++][JAVA] Dataset: Support reading from fixed offset of a file for Parquet format
[ https://issues.apache.org/jira/browse/ARROW-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongze Zhang updated ARROW-17159: - Description: This adds property start_offset_ and length_ to FileSource and should be functional for Parquet dataset format. Supporting Java and C++ dataset API at this time. (was: With that, we can use substrait plan ReadRel_LocalFiles_FileOrFiles.start() and length() to pushdown scan filter) > [C++][JAVA] Dataset: Support reading from fixed offset of a file for Parquet > format > --- > > Key: ARROW-17159 > URL: https://issues.apache.org/jira/browse/ARROW-17159 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Java >Affects Versions: 9.0.0 >Reporter: Jin Chengcheng >Assignee: Jin Chengcheng >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > This adds property start_offset_ and length_ to FileSource and should be > functional for Parquet dataset format. Supporting Java and C++ dataset API at > this time. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17066) [C++][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary
[ https://issues.apache.org/jira/browse/ARROW-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-17066. Resolution: Fixed Issue resolved by pull request 13605 [https://github.com/apache/arrow/pull/13605] > [C++][Substrait] "ignore_unknown_fields" should be specified when converting > JSON to binary > --- > > Key: ARROW-17066 > URL: https://issues.apache.org/jira/browse/ARROW-17066 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Richard Tia >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > Labels: pull-request-available, substrait > Fix For: 9.0.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions] > > When converting a substrait JSON to binary, there are many unknown fields > that may exist since substrait is being built every week. > ignore_unknown_fields should be specified when doing this conversion. > > This is resulting in frequent errors similar to this: > {code:java} > E pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned > INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure) > arguments: Cannot find field. > pyarrow/error.pxi:100: ArrowInvalid {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-16304) [C++] arrow-dataset-file-parquet-test sporadic failure in appveyor job
[ https://issues.apache.org/jira/browse/ARROW-16304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yibo Cai closed ARROW-16304. Resolution: Cannot Reproduce > [C++] arrow-dataset-file-parquet-test sporadic failure in appveyor job > -- > > Key: ARROW-16304 > URL: https://issues.apache.org/jira/browse/ARROW-16304 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Yibo Cai >Assignee: Weston Pace >Priority: Major > > https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/43330908/job/yw3djjni6as253m4 > {code:bash} > [ RUN ] > TestScan/TestParquetFileFormatScan.ScanRecordBatchReaderProjected/0Threaded16b1024r > C:/projects/arrow/cpp/src/arrow/util/future.cc:323: Check failed: > !IsFutureFinished(state_) Future already marked finished > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability
Vibhatha Lakmal Abeykoon created ARROW-17183: Summary: [C++] Adding ExecNode with Sort and Fetch capability Key: ARROW-17183 URL: https://issues.apache.org/jira/browse/ARROW-17183 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Vibhatha Lakmal Abeykoon Assignee: Vibhatha Lakmal Abeykoon In Substrait integrations with ACERO, a functionality required is the ability to fetch records sorted and unsorted. Fetch operation is defined as selecting `K` number of records with an offset. For instance pick 10 records skipping the first 5 elements. Here we can define this as a Slice operation and records can be easily extracted in a sink-node. Sort and Fetch operation applies when we need to execute a Fetch operation on sorted data. The main issue is we cannot have a sort node followed by a fetch. The reason is that all existing node definitions supporting sort are based on sink nodes. Since there cannot be a node followed by sink, this functionality has to take place in a single node. But this is not a perfect solution for fetch and sort, but one way to do this is define a sink node where the records are sorted and then a set of items are fetched. Another dilema is what if sort is followed by a fetch. In that case, there has to be a flag to enable the order of the operations. The objective of this ticket is to discuss a viable efficient solution and include new nodes or a method to execute such a logic. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16692) [C++] StackOverflow in merge generator causes segmentation fault in scan
[ https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569929#comment-17569929 ] Raúl Cumplido commented on ARROW-16692: --- This is a blocker for 9.0.0. Is there something we could do to unblock it? I am just asking because there has not been much activity on this ticket and I am not sure how this is going to affect the release schedule at the moment. > [C++] StackOverflow in merge generator causes segmentation fault in scan > > > Key: ARROW-16692 > URL: https://issues.apache.org/jira/browse/ARROW-16692 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jonathan Keane >Assignee: Weston Pace >Priority: Blocker > Fix For: 9.0.0 > > Attachments: backtrace.txt > > > I'm still working to make a minimal reproducer for this, though I can > reliably reproduce it below (though that means needing to download a bunch of > data first...). I've cleaned out much of the unnecessary code (so this query > below is a bit silly, and not what I'm actually trying to do), but haven't > been able to make a constructed dataset that reproduces this. > Working on some example with the new | more cleaned taxi dataset at > {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault: > {code} > library(arrow) > library(dplyr) > ds <- open_dataset("path/to/new_taxi/") > ds %>% > filter(!is.na(pickup_location_id)) %>% > summarise(n = n()) %>% collect() > {code} > Most of the time ends in a segfault (though I have gotten it to work on > occasion). I've tried with smaller files | constructed datasets and haven't > been able to replicate it yet. One thing that might be important is: > {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or > so. > I've attached a backtrace in case that's enough to see what's going on here. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-13238) [C++][Dataset][Compute] Substitute ExecPlan impl for dataset scans
[ https://issues.apache.org/jira/browse/ARROW-13238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vibhatha Lakmal Abeykoon reassigned ARROW-13238: Assignee: Vibhatha Lakmal Abeykoon (was: Ben Kietzman) > [C++][Dataset][Compute] Substitute ExecPlan impl for dataset scans > -- > > Key: ARROW-13238 > URL: https://issues.apache.org/jira/browse/ARROW-13238 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ben Kietzman >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 7.5h > Remaining Estimate: 0h > > ARROW-11930 grew too large to include substitution of an ExecPlan for > existing scan machinery, but this still needs to happen -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-13238) [C++][Dataset][Compute] Substitute ExecPlan impl for dataset scans
[ https://issues.apache.org/jira/browse/ARROW-13238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vibhatha Lakmal Abeykoon reassigned ARROW-13238: Assignee: Ben Kietzman (was: Vibhatha Lakmal Abeykoon) > [C++][Dataset][Compute] Substitute ExecPlan impl for dataset scans > -- > > Key: ARROW-13238 > URL: https://issues.apache.org/jira/browse/ARROW-13238 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 7.5h > Remaining Estimate: 0h > > ARROW-11930 grew too large to include substitution of an ExecPlan for > existing scan machinery, but this still needs to happen -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569933#comment-17569933 ] Raúl Cumplido commented on ARROW-15678: --- Is this still a blocker? > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 13h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability
[ https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569935#comment-17569935 ] Vibhatha Lakmal Abeykoon commented on ARROW-17183: -- Specifically about this case, we assume that this Fetch and Sort operations are the most external relations in a Substrait plan. Meaning the Sort or Fetch operation is called at the end of the other operations. This is not a very accurate representation. First we need to understand if this is the general case. cc [~westonpace] [~jvanstraten] There can be a few settings where these two operations are applied. *Sort Only* If the query only has a Sorting operation instead of adding the `SinkNodeConsumer`, we need to add a `OrderBySinkNodeConsumer`. *Fetch Only* If the query only has a Fetch operation, we can include a node with fetch capability. At the moment we don't have a node with Fetch capability, so this may need to be included where we could be able to use a siimilar logic used in `SelectK` node. *SortAndFetch or FetchAndSort* If the query contains both sort and fetch in a given order, there has to be a single Sink node which can do this operation by the given order. When scanning the plan components, when we find a sort we just add a `OrderBySink` and keep adding other relations. If we find a Fetch operation, this needs to be replaced with a SortAndFetch operation where sorting is done first and fetching is done next. And this can go vice-versa. *Another Approach:* Another approach is that we define a sink node which can execute a function which does the expected operation. In some of the defined Sink nodes (KSelect, Sort) there is a function called {*}`{*}DoFinish{*}`.{*} We should be able to call a custom function within this call. So from Substrait end when we extract the plan, then we can write the required `std::function` which would be an option for this custom sink node. And we assume that a table as input and write the logic. This way we don't have to introduce new nodes. And what if there are different capabilities users need and ACERO has a limitation, can we always keep adding nodes to fullfil that? I am not so sure. This is just a high level thought process. Although I have implemented a _SortAndFetch_ node which can perform the fetch followed by a sort just by following what is being done in Sort and SelectK nodes. But I am not exactly sure any of these approaches are optimized or the best way to solve the problem. This is the doubtful component which I am not quite clear what would be the most optimize way to include this capability. Or if there is a better way to do this. Appreciate your thoughts. cc [~westonpace] [~jvanstraten] [~bkietz] [~icook] > [C++] Adding ExecNode with Sort and Fetch capability > > > Key: ARROW-17183 > URL: https://issues.apache.org/jira/browse/ARROW-17183 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > In Substrait integrations with ACERO, a functionality required is the ability > to fetch records sorted and unsorted. > Fetch operation is defined as selecting `K` number of records with an offset. > For instance pick 10 records skipping the first 5 elements. Here we can > define this as a Slice operation and records can be easily extracted in a > sink-node. > Sort and Fetch operation applies when we need to execute a Fetch operation on > sorted data. The main issue is we cannot have a sort node followed by a > fetch. The reason is that all existing node definitions supporting sort are > based on sink nodes. Since there cannot be a node followed by sink, this > functionality has to take place in a single node. > But this is not a perfect solution for fetch and sort, but one way to do this > is define a sink node where the records are sorted and then a set of items > are fetched. > Another dilema is what if sort is followed by a fetch. In that case, there > has to be a flag to enable the order of the operations. > The objective of this ticket is to discuss a viable efficient solution and > include new nodes or a method to execute such a logic. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17142) [Python] Parquet FileMetadata.equals() method segfaults when passed None
[ https://issues.apache.org/jira/browse/ARROW-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alenka Frim reassigned ARROW-17142: --- Assignee: Kshiteej K > [Python] Parquet FileMetadata.equals() method segfaults when passed None > > > Key: ARROW-17142 > URL: https://issues.apache.org/jira/browse/ARROW-17142 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Kshiteej K >Assignee: Kshiteej K >Priority: Major > Labels: good-first-issue, pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > > {code:java} > import pyarrow as pa import pyarrow.parquet as pq > table = pa.table({"a": [1, 2, 3]}) > # Here metadata is None > metadata = table.schema.metadata > fname = "data.parquet" > pq.write_table(table, fname) # Get `metadata`. > r_metadata = pq.read_metadata(fname) > # Equals on Metadata segfaults when passed None > r_metadata.equals(metadata) {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17184) [R] Investigate Nodes missing from ExecPlan print out
Dragoș Moldovan-Grünfeld created ARROW-17184: Summary: [R] Investigate Nodes missing from ExecPlan print out Key: ARROW-17184 URL: https://issues.apache.org/jira/browse/ARROW-17184 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Dragoș Moldovan-Grünfeld Now that we can print an ExecPlan (ARROW-15016), we can investigate unexpected outputs. With the following chunk of code: {code:r} mtcars %>% arrow_table() %>% select(mpg, wt, cyl) %>% filter(mpg > 20) %>% arrange(desc(wt)) %>% head(3) %>% show_exec_plan() #> ExecPlan with 3 nodes: #> 2:SinkNode{} #> 1:ProjectNode{projection=[mpg, wt, cyl]} #> 0:SourceNode{} {code} * FilterNode disappears when head/tail are involved + * we do not have additional information regarding the OrderBySinkNode * the entry point is a SourceNode and not a TableSourceNode -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17184) [R] Investigate Nodes missing from ExecPlan print out
[ https://issues.apache.org/jira/browse/ARROW-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569947#comment-17569947 ] Dragoș Moldovan-Grünfeld commented on ARROW-17184: -- A relevant comment - https://github.com/apache/arrow/pull/13541#issuecomment-1192085798 > [R] Investigate Nodes missing from ExecPlan print out > - > > Key: ARROW-17184 > URL: https://issues.apache.org/jira/browse/ARROW-17184 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > > Now that we can print an ExecPlan (ARROW-15016), we can investigate > unexpected outputs. With the following chunk of code: > {code:r} > mtcars %>% > arrow_table() %>% > select(mpg, wt, cyl) %>% > filter(mpg > 20) %>% > arrange(desc(wt)) %>% > head(3) %>% > show_exec_plan() > #> ExecPlan with 3 nodes: > #> 2:SinkNode{} > #> 1:ProjectNode{projection=[mpg, wt, cyl]} > #> 0:SourceNode{} > {code} > * FilterNode disappears when head/tail are involved + > * we do not have additional information regarding the OrderBySinkNode > * the entry point is a SourceNode and not a TableSourceNode -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17166) [R] [CI] Remove ENV TZ from docker files
[ https://issues.apache.org/jira/browse/ARROW-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569958#comment-17569958 ] Dragoș Moldovan-Grünfeld commented on ARROW-17166: -- It turns out I misinterpreted the CI output and this was an issue with OOM in the large memory tests. See [comment|https://github.com/apache/arrow/pull/13680#issuecomment-1191873953]. > [R] [CI] Remove ENV TZ from docker files > > > Key: ARROW-17166 > URL: https://issues.apache.org/jira/browse/ARROW-17166 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, R >Reporter: Rok Mihevc >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: CI, pull-request-available > Fix For: 9.0.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > We have noticed R CI job (AMD64 Ubuntu 20.04 R 4.2 Force-Tests true) failing > on master: > [1|https://github.com/apache/arrow/runs/7424773120?check_suite_focus=true#step:7:5547], > > [2|https://github.com/apache/arrow/runs/7431821192?check_suite_focus=true#step:7:5804], > > [3|https://github.com/apache/arrow/runs/7445803518?check_suite_focus=true#step:7:16305] > with: > {code:java} > Start test: array uses local timezone for POSIXct without timezone > test-Array.R:269:3 [success] > System has not been booted with systemd as init system (PID 1). Can't operate. > Failed to create bus connection: Host is down > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17166) [R] [CI] Exclude large memory tests from the force-tests job on CI
[ https://issues.apache.org/jira/browse/ARROW-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld updated ARROW-17166: - Summary: [R] [CI] Exclude large memory tests from the force-tests job on CI (was: [R] [CI] Remove ENV TZ from docker files) > [R] [CI] Exclude large memory tests from the force-tests job on CI > -- > > Key: ARROW-17166 > URL: https://issues.apache.org/jira/browse/ARROW-17166 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, R >Reporter: Rok Mihevc >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: CI, pull-request-available > Fix For: 9.0.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > We have noticed R CI job (AMD64 Ubuntu 20.04 R 4.2 Force-Tests true) failing > on master: > [1|https://github.com/apache/arrow/runs/7424773120?check_suite_focus=true#step:7:5547], > > [2|https://github.com/apache/arrow/runs/7431821192?check_suite_focus=true#step:7:5804], > > [3|https://github.com/apache/arrow/runs/7445803518?check_suite_focus=true#step:7:16305] > with: > {code:java} > Start test: array uses local timezone for POSIXct without timezone > test-Array.R:269:3 [success] > System has not been booted with systemd as init system (PID 1). Can't operate. > Failed to create bus connection: Host is down > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17171) [C++][Gandiva] Implement NextDay Function to case-insensitive
[ https://issues.apache.org/jira/browse/ARROW-17171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinicius Souza Roque updated ARROW-17171: - Summary: [C++][Gandiva] Implement NextDay Function to case-insensitive (was: [C++][Gandiva] Implement case-insensitive) > [C++][Gandiva] Implement NextDay Function to case-insensitive > - > > Key: ARROW-17171 > URL: https://issues.apache.org/jira/browse/ARROW-17171 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Gandiva >Reporter: Vinicius Souza Roque >Priority: Trivial > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Implementing changes for the function to be case-insensitive -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569968#comment-17569968 ] Antoine Pitrou commented on ARROW-15678: Ideally it would... But there's little chance for it to be fixed in time for 9.0.0. As I said above, the workaround should be to disable runtime SIMD optimizations on the affected builds. Somehow has to validate that suggestion, though (i.e. someone who's able to reproduce this issue). > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 13h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-16444) [R] Implement user-defined scalar functions in R bindings
[ https://issues.apache.org/jira/browse/ARROW-16444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dewey Dunnington resolved ARROW-16444. -- Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13397 [https://github.com/apache/arrow/pull/13397] > [R] Implement user-defined scalar functions in R bindings > - > > Key: ARROW-16444 > URL: https://issues.apache.org/jira/browse/ARROW-16444 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dewey Dunnington >Assignee: Dewey Dunnington >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 26h 10m > Remaining Estimate: 0h > > In ARROW-15639, user-defined (scalar) functions were implemented for Python. > In ARROW-15841 and ARROW-15168 we developed some tooling and strategies for > calling into R from non-R threads, so in theory we should be able to mirror > the Python implementation (possibly with the constraint that we have to > provide a way to return a {{Table}} instead of a {{RecordBatchReader}} if > there are any user-defined functions, otherwise we can't guarantee the > existence of an event loop to do the R evaluating?). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569991#comment-17569991 ] Jacob Wujciak-Jens commented on ARROW-15678: Looking at [ARROW-15664] and this [PR|https://github.com/apache/arrow/pull/12364/files#diff-ca50d864d033146f9135f2fc25ae337322982dd340c6fa25b1efe9f0c02db870] it seems like a workaround has been implemented for homebrew IIUC, so this is still an issue but as the real fix wont happen for 9.0.0 it shouldn't be a blocker anymore? > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 13h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569996#comment-17569996 ] Antoine Pitrou commented on ARROW-15678: If that was actually accepted by Homebrew then fine. > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 13h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17184) [R] Investigate Nodes missing from ExecPlan print out for nested queries
[ https://issues.apache.org/jira/browse/ARROW-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld updated ARROW-17184: - Summary: [R] Investigate Nodes missing from ExecPlan print out for nested queries (was: [R] Investigate Nodes missing from ExecPlan print out) > [R] Investigate Nodes missing from ExecPlan print out for nested queries > > > Key: ARROW-17184 > URL: https://issues.apache.org/jira/browse/ARROW-17184 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > > Now that we can print an ExecPlan (ARROW-15016), we can investigate > unexpected outputs. With the following chunk of code: > {code:r} > mtcars %>% > arrow_table() %>% > select(mpg, wt, cyl) %>% > filter(mpg > 20) %>% > arrange(desc(wt)) %>% > head(3) %>% > show_exec_plan() > #> ExecPlan with 3 nodes: > #> 2:SinkNode{} > #> 1:ProjectNode{projection=[mpg, wt, cyl]} > #> 0:SourceNode{} > {code} > * FilterNode disappears when head/tail are involved + > * we do not have additional information regarding the OrderBySinkNode > * the entry point is a SourceNode and not a TableSourceNode -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17184) [R] Investigate Nodes missing from ExecPlan print out for nested queries
[ https://issues.apache.org/jira/browse/ARROW-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld updated ARROW-17184: - Description: *Update:* We currently do not print a plan for nested queries as that would involve the execution of the inner most query / queries. That is likely the reason why, in the chunk below, we do not see the filter and arrange nodes. == *Original description:* Now that we can print an ExecPlan (ARROW-15016), we can investigate unexpected outputs. With the following chunk of code: {code:r} mtcars %>% arrow_table() %>% select(mpg, wt, cyl) %>% filter(mpg > 20) %>% arrange(desc(wt)) %>% head(3) %>% show_exec_plan() #> ExecPlan with 3 nodes: #> 2:SinkNode{} #> 1:ProjectNode{projection=[mpg, wt, cyl]} #> 0:SourceNode{} {code} * FilterNode disappears when head/tail are involved + * we do not have additional information regarding the OrderBySinkNode * the entry point is a SourceNode and not a TableSourceNode was: Now that we can print an ExecPlan (ARROW-15016), we can investigate unexpected outputs. With the following chunk of code: {code:r} mtcars %>% arrow_table() %>% select(mpg, wt, cyl) %>% filter(mpg > 20) %>% arrange(desc(wt)) %>% head(3) %>% show_exec_plan() #> ExecPlan with 3 nodes: #> 2:SinkNode{} #> 1:ProjectNode{projection=[mpg, wt, cyl]} #> 0:SourceNode{} {code} * FilterNode disappears when head/tail are involved + * we do not have additional information regarding the OrderBySinkNode * the entry point is a SourceNode and not a TableSourceNode > [R] Investigate Nodes missing from ExecPlan print out for nested queries > > > Key: ARROW-17184 > URL: https://issues.apache.org/jira/browse/ARROW-17184 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > > *Update:* We currently do not print a plan for nested queries as that would > involve the execution of the inner most query / queries. That is likely the > reason why, in the chunk below, we do not see the filter and arrange nodes. > == > *Original description:* > Now that we can print an ExecPlan (ARROW-15016), we can investigate > unexpected outputs. With the following chunk of code: > {code:r} > mtcars %>% > arrow_table() %>% > select(mpg, wt, cyl) %>% > filter(mpg > 20) %>% > arrange(desc(wt)) %>% > head(3) %>% > show_exec_plan() > #> ExecPlan with 3 nodes: > #> 2:SinkNode{} > #> 1:ProjectNode{projection=[mpg, wt, cyl]} > #> 0:SourceNode{} > {code} > * FilterNode disappears when head/tail are involved + > * we do not have additional information regarding the OrderBySinkNode > * the entry point is a SourceNode and not a TableSourceNode -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16202) [C++][Parquet] WipeOutDecryptionKeys doesn't securely wipe out keys
[ https://issues.apache.org/jira/browse/ARROW-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido updated ARROW-16202: -- Fix Version/s: 10.0.0 (was: 9.0.0) > [C++][Parquet] WipeOutDecryptionKeys doesn't securely wipe out keys > --- > > Key: ARROW-16202 > URL: https://issues.apache.org/jira/browse/ARROW-16202 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Parquet >Affects Versions: 7.0.0 >Reporter: Antoine Pitrou >Priority: Critical > Fix For: 10.0.0 > > > {{InternalFileDecryptor::WipeOutDecryptionKeys()}} merely call > {{std::string::clear}} to dispose of the decryption key contents, but that > method is not guaranteed to clear memory (it probably doesn't, actually). > We should probably devise a portable wrapper function for the various > OS-specific memory clearing utilities. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16795) [C#][Flight] Nightly verify-rc-source-csharp-macos-arm64 fails
[ https://issues.apache.org/jira/browse/ARROW-16795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido updated ARROW-16795: -- Fix Version/s: (was: 9.0.0) > [C#][Flight] Nightly verify-rc-source-csharp-macos-arm64 fails > -- > > Key: ARROW-16795 > URL: https://issues.apache.org/jira/browse/ARROW-16795 > Project: Apache Arrow > Issue Type: Bug > Components: C#, FlightRPC >Affects Versions: 9.0.0 >Reporter: Raúl Cumplido >Priority: Critical > > The "verify-rc-source-csharp-macos-arm64" job has been failing on and off > since ~may 18th, the issue seems to be with the flight tests. > {code:java} > Failed Apache.Arrow.Flight.Tests.FlightTests.TestGetFlightMetadata [567 ms] > Error Message: > Grpc.Core.RpcException : Status(StatusCode="Internal", Detail="Error > starting gRPC call. HttpRequestException: An HTTP/2 connection could not be > established because the server did not complete the HTTP/2 handshake.", > DebugException="System.Net.Http.HttpRequestException: An HTTP/2 connection > could not be established because the server did not complete the HTTP/2 > handshake. > at > System.Net.Http.HttpConnectionPool.ReturnHttp2Connection(Http2Connection > connection, Boolean isNewConnection) > at > System.Net.Http.HttpConnectionPool.AddHttp2ConnectionAsync(HttpRequestMessage > request) > at > System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object > s) > at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread > threadPoolThread, ExecutionContext executionContext, ContextCallback > callback, Object state) > at > System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread > threadPoolThread) > at > System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecuteFromThreadPool(Thread > threadPoolThread) > at System.Threading.ThreadPoolWorkQueue.Dispatch() > at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() > at System.Threading.Thread.StartCallback() > --- End of stack trace from previous location --- > at > System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken > cancellationToken) > at > System.Net.Http.HttpConnectionPool.GetHttp2ConnectionAsync(HttpRequestMessage > request, Boolean async, CancellationToken cancellationToken) > at > System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage > request, Boolean async, Boolean doRequestAuth, CancellationToken > cancellationToken) > at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, > Boolean async, CancellationToken cancellationToken) > at Grpc.Net.Client.Internal.GrpcCall`2.RunCall(HttpRequestMessage request, > Nullable`1 timeout)") > Stack Trace: > at Grpc.Net.Client.Internal.GrpcCall`2.GetResponseHeadersCoreAsync() > at > Apache.Arrow.Flight.Client.FlightClient.<>c.d.MoveNext() in > /Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/csharp/src/Apache.Arrow.Flight/Client/FlightClient.cs:line > 71 > --- End of stack trace from previous location --- > at Apache.Arrow.Flight.Tests.FlightTests.TestGetFlightMetadata() in > /Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/csharp/test/Apache.Arrow.Flight.Tests/FlightTests.cs:line > 183 > --- End of stack trace from previous location --- > Failed Apache.Arrow.Flight.Tests.FlightTests.TestGetSchema [108 ms] > Error Message: > Grpc.Core.RpcException : Status(StatusCode="Internal", Detail="Error > starting gRPC call. HttpRequestException: An HTTP/2 connection could not be > established because the server did not complete the HTTP/2 handshake.", > DebugException="System.Net.Http.HttpRequestException: An HTTP/2 connection > could not be established because the server did not complete the HTTP/2 > handshake. > at > System.Net.Http.HttpConnectionPool.ReturnHttp2Connection(Http2Connection > connection, Boolean isNewConnection) > at > System.Net.Http.HttpConnectionPool.AddHttp2ConnectionAsync(HttpRequestMessage > request) > at > System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& > stateMachine) > at > System.Net.Http.HttpConnectionPool.AddHttp2ConnectionAsync(HttpRequestMessage > request) > at > System.Net.Http.HttpConnectionPool.<>c__DisplayClass78_0.b__0() > at System.Threading.Tasks.Task`1.InnerInvoke() > at System.Threading.Tasks.Task.<>c.<.cctor>b__272_0(Object obj) > at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread > threadPoolThread, ExecutionContext executionContext, ContextCallback > callback,
[jira] [Updated] (ARROW-16919) [C++] Flight integration tests fail on verify rc nightly on linux amd64
[ https://issues.apache.org/jira/browse/ARROW-16919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido updated ARROW-16919: -- Fix Version/s: (was: 9.0.0) > [C++] Flight integration tests fail on verify rc nightly on linux amd64 > --- > > Key: ARROW-16919 > URL: https://issues.apache.org/jira/browse/ARROW-16919 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration, FlightRPC >Reporter: Raúl Cumplido >Priority: Critical > Labels: Nightly, pull-request-available > Time Spent: 2h 50m > Remaining Estimate: 0h > > Some of our nightly builds to verify the release are failing: > {color:#1d1c1d}- > {color}[verify-rc-source-integration-linux-almalinux-8-amd64|https://github.com/ursacomputing/crossbow/runs/7073206980?check_suite_focus=true] > {color:#1d1c1d}- > {color}[verify-rc-source-integration-linux-ubuntu-18.04-amd64|https://github.com/ursacomputing/crossbow/runs/7073217433?check_suite_focus=true] > {color:#1d1c1d}- > {color}[verify-rc-source-integration-linux-ubuntu-20.04-amd64|https://github.com/ursacomputing/crossbow/runs/7073210299?check_suite_focus=true] > {color:#1d1c1d}- > {color}[verify-rc-source-integration-linux-ubuntu-22.04-amd64|https://github.com/ursacomputing/crossbow/runs/7073273051?check_suite_focus=true] > with the following: > {code:java} > # FAILURES # > FAILED TEST: middleware C++ producing, C++ consuming > 1 failures > File "/arrow/dev/archery/archery/integration/util.py", line 139, in run_cmd > output = subprocess.check_output(cmd, stderr=subprocess.STDOUT) > File "/usr/lib/python3.8/subprocess.py", line 411, in check_output > return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, > File "/usr/lib/python3.8/subprocess.py", line 512, in run > raise CalledProcessError(retcode, process.args, > subprocess.CalledProcessError: Command > '['/tmp/arrow-HEAD.PZocX/cpp-build/release/flight-test-integration-client', > '-host', 'localhost', '-port=36719', '-scenario', 'middleware']' died with > . > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "/arrow/dev/archery/archery/integration/runner.py", line 379, in > _run_flight_test_case > consumer.flight_request(port, **client_args) > File "/arrow/dev/archery/archery/integration/tester_cpp.py", line 134, in > flight_request > run_cmd(cmd) > File "/arrow/dev/archery/archery/integration/util.py", line 148, in run_cmd > raise RuntimeError(sio.getvalue()) > RuntimeError: Command failed: > /tmp/arrow-HEAD.PZocX/cpp-build/release/flight-test-integration-client -host > localhost -port=36719 -scenario middleware > With output: > -- > Headers received successfully on failing call. > Headers received successfully on passing call. > free(): double free detected in tcache 2 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16727) [C++] Bump version of bundled AWS SDK
[ https://issues.apache.org/jira/browse/ARROW-16727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido updated ARROW-16727: -- Fix Version/s: 10.0.0 > [C++] Bump version of bundled AWS SDK > - > > Key: ARROW-16727 > URL: https://issues.apache.org/jira/browse/ARROW-16727 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > Fix For: 10.0.0 > > > The latest version on the 1.8 line is 1.8.186. > We could also try to switch to the 1.9 line, but there were blocking issues > last time we tried. > Note we should bump the dependent AWS library versions at the same time. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16724) [C++] Bump versions of bundled dependencies
[ https://issues.apache.org/jira/browse/ARROW-16724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido updated ARROW-16724: -- Fix Version/s: 10.0.0 (was: 9.0.0) > [C++] Bump versions of bundled dependencies > --- > > Key: ARROW-16724 > URL: https://issues.apache.org/jira/browse/ARROW-16724 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Antoine Pitrou >Priority: Critical > Fix For: 10.0.0 > > > We should bump bundled dependencies to their latest respective versions > before 9.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16339) [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to Arrow Schema metadata
[ https://issues.apache.org/jira/browse/ARROW-16339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido updated ARROW-16339: -- Fix Version/s: 10.0.0 (was: 9.0.0) > [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to > Arrow Schema metadata > - > > Key: ARROW-16339 > URL: https://issues.apache.org/jira/browse/ARROW-16339 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet, Python >Reporter: Joris Van den Bossche >Priority: Critical > Fix For: 10.0.0 > > > Context: I ran into this issue when reading Parquet files created by GDAL > (using the Arrow C++ APIs, [https://github.com/OSGeo/gdal/pull/5477]), which > writes files that have custom key_value_metadata, but without storing > ARROW:schema in those metadata (cc [~paleolimbot] > — > Both in reading and writing files, I expected that we would map Arrow > {{Schema::metadata}} with Parquet {{{}FileMetaData::key_value_metadata{}}}. > But apparently this doesn't (always) happen out of the box, and only happens > through the "ARROW:schema" field (which stores the original Arrow schema, and > thus the metadata stored in this schema). > For example, when writing a Table with schema metadata, this is not stored > directly in the Parquet FileMetaData (code below is using branch from > ARROW-16337 to have the {{store_schema}} keyword): > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table({'a': [1, 2, 3]}, metadata={"key": "value"}) > pq.write_table(table, "test_metadata_with_arrow_schema.parquet") > pq.write_table(table, "test_metadata_without_arrow_schema.parquet", > store_schema=False) > # original schema has metadata > >>> table.schema > a: int64 > -- schema metadata -- > key: 'value' > # reading back only has the metadata in case we stored ARROW:schema > >>> pq.read_table("test_metadata_with_arrow_schema.parquet").schema > a: int64 > -- schema metadata -- > key: 'value' > # and not if ARROW:schema is absent > >>> pq.read_table("test_metadata_without_arrow_schema.parquet").schema > a: int64 > {code} > It seems that if we store the ARROW:schema, we _also_ store the schema > metadata separately. But if {{store_schema}} is False, we also stop writing > those metadata (not fully sure if this is the intended behaviour, and that's > the reason for the above output): > {code:python} > # when storing the ARROW:schema, we ALSO store key:value metadata > >>> pq.read_metadata("test_metadata_with_arrow_schema.parquet").metadata > {b'ARROW:schema': b'/7AQAAAKAA4ABgAFAA...', > b'key': b'value'} > # when not storing the schema, we also don't store the key:value > >>> pq.read_metadata("test_metadata_without_arrow_schema.parquet").metadata > >>> is None > True > {code} > On the reading side, it seems that we generally do read custom key/value > metadata into schema metadata. We don't have the pyarrow APIs at the moment > to create such a file (given the above), but with a small patch I could > create such a file: > {code:python} > # a Parquet file with ParquetFileMetaData::metadata that ONLY has a custom key > >>> pq.read_metadata("test_metadata_without_arrow_schema2.parquet").metadata > {b'key': b'value'} > # this metadata is now correctly mapped to the Arrow schema metadata > >>> pq.read_schema("test_metadata_without_arrow_schema2.parquet") > a: int64 > -- schema metadata -- > key: 'value' > {code} > But if you have a file that has both custom key/value metadata and an > "ARROW:schema" key, we actually ignore the custom keys, and only look at the > "ARROW:schema" one. > This was the case that I ran into with GDAL, where I have a file with both > keys, but where the custom "geo" key is not also included in the serialized > arrow schema in the "ARROW:schema" key: > {code:python} > # includes both keys in the Parquet file > >>> pq.read_metadata("test_gdal.parquet").metadata > {b'geo': b'{"version":"0.1.0","...', > b'ARROW:schema': b'/3gBAAAQ...'} > # the "geo" key is lost in the Arrow schema > >>> pq.read_table("test_gdal.parquet").schema.metadata is None > True > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17164) [C++] Expose higher-level utility to execute a kernel
[ https://issues.apache.org/jira/browse/ARROW-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido updated ARROW-17164: -- Fix Version/s: 10.0.0 (was: 9.0.0) > [C++] Expose higher-level utility to execute a kernel > - > > Key: ARROW-17164 > URL: https://issues.apache.org/jira/browse/ARROW-17164 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > Fix For: 10.0.0 > > > Currently, the compute layer exposes several high-level facilities to execute > a compute function: {{CallFunction}} and {{Function::Execute}}. > However, if you'd favor a two-step approach of first resolving the {{Kernel}} > for a given set of argument types, then execute the kernel, then you're > forced to deal with the rather cumbersome {{Kernel}} execution interface. > It would be nice if the base {{Kernel}} class had something similar to the > {{Function::Execute}} method. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17170) [C++][Docs] Research Documentation Formats
[ https://issues.apache.org/jira/browse/ARROW-17170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kae Suarez updated ARROW-17170: --- Summary: [C++][Docs] Research Documentation Formats (was: Research Documentation Formats) > [C++][Docs] Research Documentation Formats > -- > > Key: ARROW-17170 > URL: https://issues.apache.org/jira/browse/ARROW-17170 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++, Documentation >Reporter: Kae Suarez >Assignee: Kae Suarez >Priority: Major > > In order to revise the documentation, some inspiration is needed to get the > format right. This ticket provides a space for exploration of possible > inspiration for the C++ documentation – once we have some good examples > and/or agreement, we can move to some content creation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17142) [Python] Parquet FileMetadata.equals() method segfaults when passed None
[ https://issues.apache.org/jira/browse/ARROW-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li resolved ARROW-17142. -- Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13658 [https://github.com/apache/arrow/pull/13658] > [Python] Parquet FileMetadata.equals() method segfaults when passed None > > > Key: ARROW-17142 > URL: https://issues.apache.org/jira/browse/ARROW-17142 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Kshiteej K >Assignee: Kshiteej K >Priority: Major > Labels: good-first-issue, pull-request-available > Fix For: 9.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > > {code:java} > import pyarrow as pa import pyarrow.parquet as pq > table = pa.table({"a": [1, 2, 3]}) > # Here metadata is None > metadata = table.schema.metadata > fname = "data.parquet" > pq.write_table(table, fname) # Get `metadata`. > r_metadata = pq.read_metadata(fname) > # Equals on Metadata segfaults when passed None > r_metadata.equals(metadata) {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17185) [R] [C++] Remove duplicated code in ExecPlan_BuildAndShow and ExecPla$BuildAndShow()
Dragoș Moldovan-Grünfeld created ARROW-17185: Summary: [R] [C++] Remove duplicated code in ExecPlan_BuildAndShow and ExecPla$BuildAndShow() Key: ARROW-17185 URL: https://issues.apache.org/jira/browse/ARROW-17185 Project: Apache Arrow Issue Type: Improvement Components: C++, R Affects Versions: 8.0.0 Reporter: Dragoș Moldovan-Grünfeld -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17185) [R] [C++] Remove duplicated code in ExecPlan_BuildAndShow and ExecPlan$BuildAndShow()
[ https://issues.apache.org/jira/browse/ARROW-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld updated ARROW-17185: - Summary: [R] [C++] Remove duplicated code in ExecPlan_BuildAndShow and ExecPlan$BuildAndShow() (was: [R] [C++] Remove duplicated code in ExecPlan_BuildAndShow and ExecPla$BuildAndShow()) > [R] [C++] Remove duplicated code in ExecPlan_BuildAndShow and > ExecPlan$BuildAndShow() > - > > Key: ARROW-17185 > URL: https://issues.apache.org/jira/browse/ARROW-17185 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Affects Versions: 8.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17185) [R] [C++] Remove duplicated code in ExecPlan_BuildAndShow and ExecPlan$BuildAndShow()
[ https://issues.apache.org/jira/browse/ARROW-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld updated ARROW-17185: - Description: When adding a print method for ExecPlans in R we chose to copy some of the code from {{ExecPlan_prepare}} to {{Exec_Plan_BuildAndShow}} and from {{ExecPlan$Run()}} to {{{}ExecPlan$BuildAndShow(){}}}. Relevant PR: https://github.com/apache/arrow/pull/13541 > [R] [C++] Remove duplicated code in ExecPlan_BuildAndShow and > ExecPlan$BuildAndShow() > - > > Key: ARROW-17185 > URL: https://issues.apache.org/jira/browse/ARROW-17185 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Affects Versions: 8.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > > When adding a print method for ExecPlans in R we chose to copy some of the > code from {{ExecPlan_prepare}} to {{Exec_Plan_BuildAndShow}} and from > {{ExecPlan$Run()}} to {{{}ExecPlan$BuildAndShow(){}}}. > Relevant PR: https://github.com/apache/arrow/pull/13541 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17185) [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and ExecPlan$BuildAndShow()
[ https://issues.apache.org/jira/browse/ARROW-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld updated ARROW-17185: - Component/s: C++ > [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and > ExecPlan$BuildAndShow() > --- > > Key: ARROW-17185 > URL: https://issues.apache.org/jira/browse/ARROW-17185 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Affects Versions: 8.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > > When adding a print method for ExecPlans in R we chose to copy some of the > code from {{ExecPlan_prepare}} to {{Exec_Plan_BuildAndShow}} and from > {{ExecPlan$Run()}} to {{{}ExecPlan$BuildAndShow(){}}}. > Relevant PR: https://github.com/apache/arrow/pull/13541 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17185) [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and ExecPlan$BuildAndShow()
[ https://issues.apache.org/jira/browse/ARROW-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld updated ARROW-17185: - Component/s: (was: C++) > [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and > ExecPlan$BuildAndShow() > --- > > Key: ARROW-17185 > URL: https://issues.apache.org/jira/browse/ARROW-17185 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 8.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > > When adding a print method for ExecPlans in R we chose to copy some of the > code from {{ExecPlan_prepare}} to {{Exec_Plan_BuildAndShow}} and from > {{ExecPlan$Run()}} to {{{}ExecPlan$BuildAndShow(){}}}. > Relevant PR: https://github.com/apache/arrow/pull/13541 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17185) [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and ExecPlan$BuildAndShow()
[ https://issues.apache.org/jira/browse/ARROW-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dragoș Moldovan-Grünfeld updated ARROW-17185: - Summary: [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and ExecPlan$BuildAndShow() (was: [R] [C++] Remove duplicated code in ExecPlan_BuildAndShow and ExecPlan$BuildAndShow()) > [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and > ExecPlan$BuildAndShow() > --- > > Key: ARROW-17185 > URL: https://issues.apache.org/jira/browse/ARROW-17185 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Affects Versions: 8.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > > When adding a print method for ExecPlans in R we chose to copy some of the > code from {{ExecPlan_prepare}} to {{Exec_Plan_BuildAndShow}} and from > {{ExecPlan$Run()}} to {{{}ExecPlan$BuildAndShow(){}}}. > Relevant PR: https://github.com/apache/arrow/pull/13541 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17104) [CI][Python] Pyarrow cannot be imported on CI job AMD64 MacOS 10.15 Python 3
[ https://issues.apache.org/jira/browse/ARROW-17104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570088#comment-17570088 ] Raúl Cumplido commented on ARROW-17104: --- This is not critical anymore as the MAC OS build has been fixed with the upstream homebrew package update for protobuf. I am not closing it as we might want to pick some of the changes from the initial PR. > [CI][Python] Pyarrow cannot be imported on CI job AMD64 MacOS 10.15 Python 3 > - > > Key: ARROW-17104 > URL: https://issues.apache.org/jira/browse/ARROW-17104 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Raúl Cumplido >Assignee: Antoine Pitrou >Priority: Critical > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 7h > Remaining Estimate: 0h > > The *AMD64 MacOS 10.15 Python 3* job has started to fail on master with the > following error: > {code:java} > + pytest -r s -v --pyargs pyarrow > = test session starts > == > platform darwin -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0 -- > /usr/local/opt/python@3.9/bin/python3.9 > cachedir: .pytest_cache > hypothesis profile 'default' -> > database=DirectoryBasedExampleDatabase('/Users/runner/work/arrow/arrow/.hypothesis/examples') > rootdir: /Users/runner/work/arrow/arrow > plugins: hypothesis-6.52.1, lazy-fixture-0.6.3 > collecting ... collected 0 items / 1 error > ERRORS > > ERROR collecting test session > _ > /usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py:127: > in import_module > return _bootstrap._gcd_import(name[level:], package, level) > :1030: in _gcd_import > ??? > :1007: in _find_and_load > ??? > :972: in _find_and_load_unlocked > ??? > :228: in _call_with_frames_removed > ??? > :1030: in _gcd_import > ??? > :1007: in _find_and_load > ??? > :986: in _find_and_load_unlocked > ??? > :680: in _load_unlocked > ??? > :850: in exec_module > ??? > :228: in _call_with_frames_removed > ??? > /usr/local/lib/python3.9/site-packages/pyarrow/__init__.py:65: in > import pyarrow.lib as _lib > E ImportError: > dlopen(/usr/local/lib/python3.9/site-packages/pyarrow/lib.cpython-39-darwin.so, > 2): Symbol not found: __ZN6google8protobuf8internal16InternalMetadataD1Ev > E Referenced from: /usr/local/lib/libarrow.900.dylib > E Expected in: flat namespace > E in /usr/local/lib/libarrow.900.dylib > Interrupted: 1 error during collection > > === 1 error in 5.18s > === > Error: Process completed with exit code 2. {code} > See an example of build on arrow/master branch: > https://github.com/apache/arrow/runs/7385879183?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17104) [CI][Python] Pyarrow cannot be imported on CI job AMD64 MacOS 10.15 Python 3
[ https://issues.apache.org/jira/browse/ARROW-17104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido updated ARROW-17104: -- Fix Version/s: 10.0.0 (was: 9.0.0) > [CI][Python] Pyarrow cannot be imported on CI job AMD64 MacOS 10.15 Python 3 > - > > Key: ARROW-17104 > URL: https://issues.apache.org/jira/browse/ARROW-17104 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Raúl Cumplido >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 7h > Remaining Estimate: 0h > > The *AMD64 MacOS 10.15 Python 3* job has started to fail on master with the > following error: > {code:java} > + pytest -r s -v --pyargs pyarrow > = test session starts > == > platform darwin -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0 -- > /usr/local/opt/python@3.9/bin/python3.9 > cachedir: .pytest_cache > hypothesis profile 'default' -> > database=DirectoryBasedExampleDatabase('/Users/runner/work/arrow/arrow/.hypothesis/examples') > rootdir: /Users/runner/work/arrow/arrow > plugins: hypothesis-6.52.1, lazy-fixture-0.6.3 > collecting ... collected 0 items / 1 error > ERRORS > > ERROR collecting test session > _ > /usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py:127: > in import_module > return _bootstrap._gcd_import(name[level:], package, level) > :1030: in _gcd_import > ??? > :1007: in _find_and_load > ??? > :972: in _find_and_load_unlocked > ??? > :228: in _call_with_frames_removed > ??? > :1030: in _gcd_import > ??? > :1007: in _find_and_load > ??? > :986: in _find_and_load_unlocked > ??? > :680: in _load_unlocked > ??? > :850: in exec_module > ??? > :228: in _call_with_frames_removed > ??? > /usr/local/lib/python3.9/site-packages/pyarrow/__init__.py:65: in > import pyarrow.lib as _lib > E ImportError: > dlopen(/usr/local/lib/python3.9/site-packages/pyarrow/lib.cpython-39-darwin.so, > 2): Symbol not found: __ZN6google8protobuf8internal16InternalMetadataD1Ev > E Referenced from: /usr/local/lib/libarrow.900.dylib > E Expected in: flat namespace > E in /usr/local/lib/libarrow.900.dylib > Interrupted: 1 error during collection > > === 1 error in 5.18s > === > Error: Process completed with exit code 2. {code} > See an example of build on arrow/master branch: > https://github.com/apache/arrow/runs/7385879183?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17104) [CI][Python] Pyarrow cannot be imported on CI job AMD64 MacOS 10.15 Python 3
[ https://issues.apache.org/jira/browse/ARROW-17104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raúl Cumplido updated ARROW-17104: -- Priority: Major (was: Critical) > [CI][Python] Pyarrow cannot be imported on CI job AMD64 MacOS 10.15 Python 3 > - > > Key: ARROW-17104 > URL: https://issues.apache.org/jira/browse/ARROW-17104 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Raúl Cumplido >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 7h > Remaining Estimate: 0h > > The *AMD64 MacOS 10.15 Python 3* job has started to fail on master with the > following error: > {code:java} > + pytest -r s -v --pyargs pyarrow > = test session starts > == > platform darwin -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0 -- > /usr/local/opt/python@3.9/bin/python3.9 > cachedir: .pytest_cache > hypothesis profile 'default' -> > database=DirectoryBasedExampleDatabase('/Users/runner/work/arrow/arrow/.hypothesis/examples') > rootdir: /Users/runner/work/arrow/arrow > plugins: hypothesis-6.52.1, lazy-fixture-0.6.3 > collecting ... collected 0 items / 1 error > ERRORS > > ERROR collecting test session > _ > /usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py:127: > in import_module > return _bootstrap._gcd_import(name[level:], package, level) > :1030: in _gcd_import > ??? > :1007: in _find_and_load > ??? > :972: in _find_and_load_unlocked > ??? > :228: in _call_with_frames_removed > ??? > :1030: in _gcd_import > ??? > :1007: in _find_and_load > ??? > :986: in _find_and_load_unlocked > ??? > :680: in _load_unlocked > ??? > :850: in exec_module > ??? > :228: in _call_with_frames_removed > ??? > /usr/local/lib/python3.9/site-packages/pyarrow/__init__.py:65: in > import pyarrow.lib as _lib > E ImportError: > dlopen(/usr/local/lib/python3.9/site-packages/pyarrow/lib.cpython-39-darwin.so, > 2): Symbol not found: __ZN6google8protobuf8internal16InternalMetadataD1Ev > E Referenced from: /usr/local/lib/libarrow.900.dylib > E Expected in: flat namespace > E in /usr/local/lib/libarrow.900.dylib > Interrupted: 1 error during collection > > === 1 error in 5.18s > === > Error: Process completed with exit code 2. {code} > See an example of build on arrow/master branch: > https://github.com/apache/arrow/runs/7385879183?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17110) [C++] Move away from C++11
[ https://issues.apache.org/jira/browse/ARROW-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570100#comment-17570100 ] Weston Pace commented on ARROW-17110: - Just for consideration, is the following policy possible? "We do not release new versions for R < 4 but we will consider backporting critical security issues" I'm not sure if that would be more or less work than sprinkling more ifdef/checks throughout our code base. > [C++] Move away from C++11 > -- > > Key: ARROW-17110 > URL: https://issues.apache.org/jira/browse/ARROW-17110 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: H. Vetinari >Priority: Major > > The upcoming abseil release has dropped support for C++11, so > {_}eventually{_}, arrow will have to follow. More details > [here|https://github.com/conda-forge/abseil-cpp-feedstock/issues/37]. > Relatedly, when I > [tried|https://github.com/conda-forge/abseil-cpp-feedstock/pull/25] to switch > abseil to a newer C++ version on windows, things apparently broke in arrow > CI. This is because the ABI of abseil is sensitive to the C++ standard that's > used to compile, and google only supports a homogeneous version to compile > all artefacts in a stack. This creates some friction with conda-forge (where > the compilers are generally much newer than what arrow might be willing to > impose). For now, things seems to have worked out with arrow > [specifying|https://github.com/apache/arrow/blob/897a4c0ce73c3fe07872beee2c1d2128e44f6dd4/cpp/cmake_modules/SetupCxxFlags.cmake#L121-L124] > C\+\+11 while conda-forge moved to C\+\+17 - at least on unix, but windows > was not so lucky. > Perhaps people would therefore also be interested in collaborating (or at > least commenting on) this > [issue|https://github.com/conda-forge/abseil-cpp-feedstock/issues/29], which > should permit more flexibility by being able to opt into given standard > versions also from conda-forge. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17186) [C++][Docs] Verify Completeness of API Reference
[ https://issues.apache.org/jira/browse/ARROW-17186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kae Suarez updated ARROW-17186: --- Component/s: C++ Documentation > [C++][Docs] Verify Completeness of API Reference > > > Key: ARROW-17186 > URL: https://issues.apache.org/jira/browse/ARROW-17186 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++, Documentation >Reporter: Kae Suarez >Priority: Major > > At least the Datum class is missing methods in the API documentation, and any > other incompleteness needs to be found and addressed. This ticket is for > hunting down incompleteness, and filling in the gaps. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17186) [C++][Docs] Verify Completeness of API Reference
Kae Suarez created ARROW-17186: -- Summary: [C++][Docs] Verify Completeness of API Reference Key: ARROW-17186 URL: https://issues.apache.org/jira/browse/ARROW-17186 Project: Apache Arrow Issue Type: Sub-task Reporter: Kae Suarez At least the Datum class is missing methods in the API documentation, and any other incompleteness needs to be found and addressed. This ticket is for hunting down incompleteness, and filling in the gaps. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns
[ https://issues.apache.org/jira/browse/ARROW-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace resolved ARROW-16700. - Resolution: Fixed Issue resolved by pull request 13518 [https://github.com/apache/arrow/pull/13518] > [C++] [R] [Datasets] aggregates on partitioning columns > --- > > Key: ARROW-16700 > URL: https://issues.apache.org/jira/browse/ARROW-16700 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Reporter: Jonathan Keane >Assignee: Jeroen van Straten >Priority: Blocker > Labels: pull-request-available > Fix For: 8.0.2, 9.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > When summarizing a whole dataset (without group_by) with an aggregate, and > summarizing a partitioned column, arrow returns wrong data: > {code:r} > library(arrow, warn.conflicts = FALSE) > library(dplyr, warn.conflicts = FALSE) > df <- expand.grid( > some_nulls = c(0L, 1L, 2L), > year = 2010:2023, > month = 1:12, > day = 1:30 > ) > path <- tempfile() > dir.create(path) > write_dataset(df, path, partitioning = c("year", "month")) > ds <- open_dataset(path) > # with arrow the mins/maxes are off for partitioning columns > ds %>% > summarise(n = n(), min_year = min(year), min_month = min(month), min_day = > min(day), max_year = max(year), max_month = max(month), max_day = max(day)) > %>% > collect() > #> # A tibble: 1 × 7 > #> n min_year min_month min_day max_year max_month max_day > #> > #> 1 15120 2023 1 1 202312 30 > # comapred to what we get with dplyr > df %>% > summarise(n = n(), min_year = min(year), min_month = min(month), min_day = > min(day), max_year = max(year), max_month = max(month), max_day = max(day)) > %>% > collect() > #> n min_year min_month min_day max_year max_month max_day > #> 1 15120 2010 1 1 202312 30 > # even min alone is off: > ds %>% > summarise(min_year = min(year)) %>% > collect() > #> # A tibble: 1 × 1 > #> min_year > #> > #> 1 2016 > > # but non-partitioning columns are fine: > ds %>% > summarise(min_day = min(day)) %>% > collect() > #> # A tibble: 1 × 1 > #> min_day > #> > #> 1 1 > > > # But with a group_by, this seems ok > ds %>% > group_by(some_nulls) %>% > summarise(min_year = min(year)) %>% > collect() > #> # A tibble: 3 × 2 > #> some_nulls min_year > #> > #> 1 0 2010 > #> 2 1 2010 > #> 3 2 2010 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570126#comment-17570126 ] Jacob Wujciak-Jens commented on ARROW-15678: That was my impression: [issue|https://github.com/Homebrew/homebrew-core/issues/94724] and [PR|https://github.com/Homebrew/homebrew-core/pull/94958] in homebrew-core. Maybe [~jonkeane] can confirm? > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 13h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17187) [R] Improve lazy ALTREP implementation for String
Dewey Dunnington created ARROW-17187: Summary: [R] Improve lazy ALTREP implementation for String Key: ARROW-17187 URL: https://issues.apache.org/jira/browse/ARROW-17187 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Dewey Dunnington ARROW-16578 noted that there was a high cost to looping through an ALTREP character vector that we created in the arrow R package. The temporary workaround is to materialize whenever the first element is requested, which is much faster than our initial implementation but is probably not necessary given that other ALTREP character implementations appear to not have this issue: (Timings before merging ARROW-16578, which reduces the 5 second operation below to 0.05 seconds). {code:R} library(arrow, warn.conflicts = FALSE) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. df1 <- tibble::tibble(x=as.character(floor(runif(100) * 20))) write_parquet(df1,"/tmp/test.parquet") df2 <- read_parquet("/tmp/test.parquet") system.time(unique(df1$x)) #>user system elapsed #> 0.022 0.001 0.023 system.time(unique(df2$x)) #>user system elapsed #> 4.529 0.680 5.226 # the speed is almost certainly not due to ALTREP itself # but is probably something to do with our implementation tf <- tempfile() readr::write_csv(df1, tf) df3 <- vroom::vroom(tf, delim = ",", altrep = TRUE) #> Rows: 100 Columns: 1 #> ── Column specification #> Delimiter: "," #> dbl (1): x #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. .Internal(inspect(df3$x)) #> @2d2042048 14 REALSXP g0c0 [REF(65535)] vroom_dbl (len=100, materialized=F) system.time(unique(df3$x)) #>user system elapsed #> 0.127 0.001 0.128 .Internal(inspect(df3$x)) #> @2d2042048 14 REALSXP g1c0 [MARK,REF(65535)] vroom_dbl (len=100, materialized=F) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-16578) [R] unique() and is.na() on a column of a tibble is much slower after writing to and reading from a parquet file
[ https://issues.apache.org/jira/browse/ARROW-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dewey Dunnington resolved ARROW-16578. -- Resolution: Fixed Issue resolved by pull request 13415 [https://github.com/apache/arrow/pull/13415] > [R] unique() and is.na() on a column of a tibble is much slower after writing > to and reading from a parquet file > > > Key: ARROW-16578 > URL: https://issues.apache.org/jira/browse/ARROW-16578 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 7.0.0, 8.0.0 >Reporter: Hideaki Hayashi >Assignee: Hideaki Hayashi >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > unique() on a column of a tibble is much slower after writing to and reading > from a parquet file. > Here is a reprex. > {{df1 <- tibble::tibble(x=as.character(floor(runif(100) * 20)))}} > {{write_parquet(df1,"/tmp/test.parquet")}} > {{df2 <- read_parquet("/tmp/test.parquet")}} > {{system.time(unique(df1$x))}} > {{# Result on my late 2020 macbook pro with M1 processor:}} > {{# user system elapsed }} > {{# 0.020 0.000 0.021 }} > {{system.time(unique(df2$x))}} > {{# user system elapsed }} > {{# 5.230 0.419 5.649 }} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17185) [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and ExecPlan$BuildAndShow()
[ https://issues.apache.org/jira/browse/ARROW-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dewey Dunnington updated ARROW-17185: - Description: When adding a print method for ExecPlans in R we chose to copy some of the code from {{ExecPlan_prepare}} to {{Exec_Plan_BuildAndShow}} and from {{ExecPlan$Run()}} to {{{}ExecPlan$BuildAndShow(){}}}. Relevant PR: https://github.com/apache/arrow/pull/13541 (ARROW-15016) was: When adding a print method for ExecPlans in R we chose to copy some of the code from {{ExecPlan_prepare}} to {{Exec_Plan_BuildAndShow}} and from {{ExecPlan$Run()}} to {{{}ExecPlan$BuildAndShow(){}}}. Relevant PR: https://github.com/apache/arrow/pull/13541 > [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and > ExecPlan$BuildAndShow() > --- > > Key: ARROW-17185 > URL: https://issues.apache.org/jira/browse/ARROW-17185 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Affects Versions: 8.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > > When adding a print method for ExecPlans in R we chose to copy some of the > code from {{ExecPlan_prepare}} to {{Exec_Plan_BuildAndShow}} and from > {{ExecPlan$Run()}} to {{{}ExecPlan$BuildAndShow(){}}}. > Relevant PR: https://github.com/apache/arrow/pull/13541 (ARROW-15016) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-15016) [R] show_exec_plan() for an arrow_dplyr_query
[ https://issues.apache.org/jira/browse/ARROW-15016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dewey Dunnington resolved ARROW-15016. -- Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13541 [https://github.com/apache/arrow/pull/13541] > [R] show_exec_plan() for an arrow_dplyr_query > - > > Key: ARROW-15016 > URL: https://issues.apache.org/jira/browse/ARROW-15016 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 18.5h > Remaining Estimate: 0h > > {*}Proposed approach{*}: [design > doc|https://docs.google.com/document/d/1Ep8aV4jDsNCCy9uv1bjWY_JF17nzHQogv0EnGJvraQI/edit#] > *Steps* > * Read about ExecPlan and ExecPlan::ToString > ** https://issues.apache.org/jira/browse/ARROW-14233 > ** https://issues.apache.org/jira/browse/ARROW-15138 > ** https://issues.apache.org/jira/browse/ARROW-13785 > * Hook up to the existing C++ ToString method for ExecPlans > * Implement a {{ToString()}} method for {{ExecPlan}} R6 class > * Implement and document {{show_exec_plan()}} > {*}Original description{*}: > Now that we can print a query plan (ARROW-13785) we should wire this up in R > so we can see what execution plans are being put together for various queries > (like the TPC-H queries) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17188) [R] Update news for 9.0.0
Will Jones created ARROW-17188: -- Summary: [R] Update news for 9.0.0 Key: ARROW-17188 URL: https://issues.apache.org/jira/browse/ARROW-17188 Project: Apache Arrow Issue Type: New Feature Components: R Affects Versions: 9.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 9.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17121) [Gandiva][C++] Adding mask function
[ https://issues.apache.org/jira/browse/ARROW-17121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Palak Pariawala resolved ARROW-17121. - Resolution: Fixed > [Gandiva][C++] Adding mask function > --- > > Key: ARROW-17121 > URL: https://issues.apache.org/jira/browse/ARROW-17121 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Gandiva >Reporter: Palak Pariawala >Assignee: Palak Pariawala >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Add mask(str inp[, str uc-mask[, str lc-mask[, str num-mask]]]) function to > Gandiva. > With default masking upper case letters as 'X', lower case letters as 'x' and > numbers as 'n'. > Custom masking as optionally specified in parameters. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17121) [Gandiva][C++] Adding mask function
[ https://issues.apache.org/jira/browse/ARROW-17121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Palak Pariawala closed ARROW-17121. --- > [Gandiva][C++] Adding mask function > --- > > Key: ARROW-17121 > URL: https://issues.apache.org/jira/browse/ARROW-17121 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Gandiva >Reporter: Palak Pariawala >Assignee: Palak Pariawala >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Add mask(str inp[, str uc-mask[, str lc-mask[, str num-mask]]]) function to > Gandiva. > With default masking upper case letters as 'X', lower case letters as 'x' and > numbers as 'n'. > Custom masking as optionally specified in parameters. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17088) [R] Use `.arrow` as extension of IPC files of datasets
[ https://issues.apache.org/jira/browse/ARROW-17088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17088: --- Labels: pull-request-available (was: ) > [R] Use `.arrow` as extension of IPC files of datasets > -- > > Key: ARROW-17088 > URL: https://issues.apache.org/jira/browse/ARROW-17088 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 8.0.0 >Reporter: SHIMA Tatsuya >Assignee: Kazuyuki Ura >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Related to ARROW-17072 > As noted in the following document, the recommended extension for IPC files > is now `.arrow`. > > We recommend the “.arrow” extension for files created with this format. > https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format > However, currently when writing a dataset with the {{write_dataset}} > function, the default extension is {{.feather}} when {{feather}} is selected > as the format, and {{.ipc}} when {{ipc}} is selected. > https://github.com/apache/arrow/blob/f295da4cfdcf102d9ac2d16bbca6f8342fc3e6a8/r/R/dataset-write.R#L124-L126 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-16395) [R] Implement lubridate's parsers with year, month, and day, hour, minute, and second components
[ https://issues.apache.org/jira/browse/ARROW-16395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dewey Dunnington resolved ARROW-16395. -- Resolution: Fixed Issue resolved by pull request 13627 [https://github.com/apache/arrow/pull/13627] > [R] Implement lubridate's parsers with year, month, and day, hour, minute, > and second components > > > Key: ARROW-16395 > URL: https://issues.apache.org/jira/browse/ARROW-16395 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Rok Mihevc >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Parse date-times with year, month, and day, hour, minute, and second > components: > ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() > mdy_h() ydm_hms() ydm_hm() ydm_h() -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570163#comment-17570163 ] Jonathan Keane commented on ARROW-15678: Homebrew only accepted that as a temporary workaround and has threatened to turn off optimizations if we don't resolve this. They haven't yet followed through yet, though. > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 13h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570164#comment-17570164 ] Antoine Pitrou commented on ARROW-15678: Note that I suggested a perhaps more acceptable workaround above. > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 13h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570163#comment-17570163 ] Jonathan Keane edited comment on ARROW-15678 at 7/22/22 7:28 PM: - Homebrew only accepted that as a temporary workaround and has threatened to turn off optimizations if we don't resolve this. They haven't yet followed through yet, though. https://github.com/Homebrew/homebrew-core/issues/94724#issuecomment-1063031123 was (Author: jonkeane): Homebrew only accepted that as a temporary workaround and has threatened to turn off optimizations if we don't resolve this. They haven't yet followed through yet, though. > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 13h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17177) [C++][Docs] Re-Organize the Existing ACERO Streaming Engine Documentation
[ https://issues.apache.org/jira/browse/ARROW-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570179#comment-17570179 ] Aldrin Montana commented on ARROW-17177: Linking related JIRAs for improving Acero documentation and updating the overview of components and layers > [C++][Docs] Re-Organize the Existing ACERO Streaming Engine Documentation > - > > Key: ARROW-17177 > URL: https://issues.apache.org/jira/browse/ARROW-17177 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++, Documentation >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > The current document is too-length. By creating sub-pages for each example > and explain the code and provide a better description, that would be much > better in terms of readability and browsing the content. The idea is to > create a sub-folder in the examples called 'acero` and include each example > in a separate `.cc` file. This is the code change. Following this, the > documentation page on the website can be splitted into sub-pages. This is the > only change suggested for this sub-task. > There is already a JIRA: https://issues.apache.org/jira/browse/ARROW-16802 to > improve the internal content. So it would be used for re-writing the contnet. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17115) [C++] HashJoin fails if it encounters a batch with more than 32Ki rows
[ https://issues.apache.org/jira/browse/ARROW-17115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace resolved ARROW-17115. - Resolution: Fixed Issue resolved by pull request 13679 [https://github.com/apache/arrow/pull/13679] > [C++] HashJoin fails if it encounters a batch with more than 32Ki rows > -- > > Key: ARROW-17115 > URL: https://issues.apache.org/jira/browse/ARROW-17115 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > The new swiss join assumes that batches are being broken according to the > morsel/batch model and it assumes those batches have, at most, 32Ki rows > (signed 16-bit indices are used in various places). > However, we are not currently slicing all of our inputs to batches this > small. This is causing conbench to fail and would likely be a problem with > any large inputs. > We should fix this by slicing batches in the engine to the appropriate > maximum size. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17189) [Python][Docs] Nightly build instructions install release version
Todd Farmer created ARROW-17189: --- Summary: [Python][Docs] Nightly build instructions install release version Key: ARROW-17189 URL: https://issues.apache.org/jira/browse/ARROW-17189 Project: Apache Arrow Issue Type: Bug Components: Documentation, Python Affects Versions: 8.0.0 Reporter: Todd Farmer The [Python installation documentation|https://arrow.apache.org/docs/python/install.html] provides the following instructions to instal nightly builds of pyarrow: {quote}{{Install the development version of PyArrow from [arrow-nightlies|https://anaconda.org/arrow-nightlies/pyarrow] conda channel:}} {quote} {quote}{{conda install -c arrow-nightlies pyarrow}}{quote} The result of this seems to be installation of the release version, not a nightly build: {code:java} (test-nightlies) todd@pop-os:~/arrow/docs$ python -c "import pyarrow; pyarrow.show_versions()" Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named 'pyarrow' (test-nightlies) todd@pop-os:~/arrow/docs$ conda install -c arrow-nightlies pyarrow Collecting package metadata (current_repodata.json): done Solving environment: done## Package Plan ## environment location: /home/todd/miniconda3/envs/test-nightlies added / updated specs: - pyarrow The following NEW packages will be INSTALLED: abseil-cpp pkgs/main/linux-64::abseil-cpp-20211102.0-hd4dd3e8_0 arrow-cpp pkgs/main/linux-64::arrow-cpp-8.0.0-py310h3098874_0 aws-c-common pkgs/main/linux-64::aws-c-common-0.4.57-he6710b0_1 aws-c-event-stream pkgs/main/linux-64::aws-c-event-stream-0.1.6-h2531618_5 aws-checksums pkgs/main/linux-64::aws-checksums-0.1.9-he6710b0_0 aws-sdk-cpp pkgs/main/linux-64::aws-sdk-cpp-1.8.185-hce553d0_0 blas pkgs/main/linux-64::blas-1.0-mkl boost-cpp pkgs/main/linux-64::boost-cpp-1.73.0-h7f8727e_12 brotli pkgs/main/linux-64::brotli-1.0.9-he6710b0_2 c-ares pkgs/main/linux-64::c-ares-1.18.1-h7f8727e_0 gflags pkgs/main/linux-64::gflags-2.2.2-he6710b0_0 glog pkgs/main/linux-64::glog-0.5.0-h2531618_0 grpc-cpp pkgs/main/linux-64::grpc-cpp-1.46.1-h33aed49_0 icu pkgs/main/linux-64::icu-58.2-he6710b0_3 intel-openmp pkgs/main/linux-64::intel-openmp-2021.4.0-h06a4308_3561 krb5 pkgs/main/linux-64::krb5-1.19.2-hac12032_0 libboost pkgs/main/linux-64::libboost-1.73.0-h28710b8_12 libcurl pkgs/main/linux-64::libcurl-7.82.0-h0b77cf5_0 libedit pkgs/main/linux-64::libedit-3.1.20210910-h7f8727e_0 libev pkgs/main/linux-64::libev-4.33-h7f8727e_1 libevent pkgs/main/linux-64::libevent-2.1.12-h8f2d780_0 libnghttp2 pkgs/main/linux-64::libnghttp2-1.46.0-hce63b2e_0 libprotobuf pkgs/main/linux-64::libprotobuf-3.20.1-h4ff587b_0 libssh2 pkgs/main/linux-64::libssh2-1.10.0-h8f2d780_0 libthrift pkgs/main/linux-64::libthrift-0.15.0-hcc01f38_0 lz4-c pkgs/main/linux-64::lz4-c-1.9.3-h295c915_1 mkl pkgs/main/linux-64::mkl-2021.4.0-h06a4308_640 mkl-service pkgs/main/linux-64::mkl-service-2.4.0-py310h7f8727e_0 mkl_fft pkgs/main/linux-64::mkl_fft-1.3.1-py310hd6ae3a3_0 mkl_random pkgs/main/linux-64::mkl_random-1.2.2-py310h00e6091_0 numpy pkgs/main/linux-64::numpy-1.22.3-py310hfa59a62_0 numpy-base pkgs/main/linux-64::numpy-base-1.22.3-py310h9585f30_0 orc pkgs/main/linux-64::orc-1.7.4-h07ed6aa_0 pyarrow pkgs/main/linux-64::pyarrow-8.0.0-py310h468efa6_0 re2 pkgs/main/linux-64::re2-2022.04.01-h295c915_0 snappy pkgs/main/linux-64::snappy-1.1.9-h295c915_0 utf8proc pkgs/main/linux-64::utf8proc-2.6.1-h27cfd23_0 zstd pkgs/main/linux-64::zstd-1.5.2-ha4553b6_0 Proceed ([y]/n)? yPreparing transaction: done Verifying transaction: done Executing transaction: done (test-nightlies) todd@pop-os:~/arrow/docs$ python -c "import pyarrow; pyarrow.show_versions()" pyarrow version info Package kind : not indicated Arrow C++ library version : 8.0.0 Arrow C++ compiler : GNU 11.2.0 Arrow C++ compiler flags : -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/todd/miniconda3/envs/test-nightlies/include -fdebug-prefix-map=/opt/conda/conda-bld/arrow-cpp_1657131305338/work=/usr/local/src/conda/arrow-cpp-8.0.0 -fdebug-prefix-map=/home/todd/miniconda3/envs/test-nightlies=/usr/local/src/conda-prefix -fdiagnostics-color=always -O3 -DNDEBUG Arrow C++ git revision : Arrow C++ git description : Arrow C++ build ty
[jira] [Commented] (ARROW-17189) [Python][Docs] Nightly build instructions install release version
[ https://issues.apache.org/jira/browse/ARROW-17189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570204#comment-17570204 ] Todd Farmer commented on ARROW-17189: - It's worth noting that the [instructions on anaconda.org|https://anaconda.org/arrow-nightlies/repo/installers?type=conda&label=main] differ, but are not successful: {code:java} (test-nightlies) todd@pop-os:~/arrow/docs$ conda install --channel "arrow-nightlies" package Collecting package metadata (current_repodata.json): done Solving environment: failed with initial frozen solve. Retrying with flexible solve. Collecting package metadata (repodata.json): done Solving environment: failed with initial frozen solve. Retrying with flexible solve.PackagesNotFoundError: The following packages are not available from current channels: - packageCurrent channels: - https://conda.anaconda.org/arrow-nightlies/linux-64 - https://conda.anaconda.org/arrow-nightlies/noarch - https://repo.anaconda.com/pkgs/main/linux-64 - https://repo.anaconda.com/pkgs/main/noarch - https://repo.anaconda.com/pkgs/r/linux-64 - https://repo.anaconda.com/pkgs/r/noarchTo search for alternate channels that may provide the conda package you're looking for, navigate to https://anaconda.organd use the search bar at the top of the page. (test-nightlies) todd@pop-os:~/arrow/docs$ {code} It may be expected that experienced conda users will know how to work around this; that assumption does not apply to me. ;) > [Python][Docs] Nightly build instructions install release version > - > > Key: ARROW-17189 > URL: https://issues.apache.org/jira/browse/ARROW-17189 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, Python >Affects Versions: 8.0.0 >Reporter: Todd Farmer >Priority: Minor > > The [Python installation > documentation|https://arrow.apache.org/docs/python/install.html] provides the > following instructions to instal nightly builds of pyarrow: > {quote}{{Install the development version of PyArrow from > [arrow-nightlies|https://anaconda.org/arrow-nightlies/pyarrow] conda > channel:}} > {quote} > {quote}{{conda install -c arrow-nightlies pyarrow}}{quote} > The result of this seems to be installation of the release version, not a > nightly build: > {code:java} > (test-nightlies) todd@pop-os:~/arrow/docs$ python -c "import pyarrow; > pyarrow.show_versions()" > Traceback (most recent call last): > File "", line 1, in > ModuleNotFoundError: No module named 'pyarrow' > (test-nightlies) todd@pop-os:~/arrow/docs$ conda install -c arrow-nightlies > pyarrow > Collecting package metadata (current_repodata.json): done > Solving environment: done## Package Plan ## environment location: > /home/todd/miniconda3/envs/test-nightlies added / updated specs: > - pyarrow > The following NEW packages will be INSTALLED: abseil-cpp > pkgs/main/linux-64::abseil-cpp-20211102.0-hd4dd3e8_0 > arrow-cpp pkgs/main/linux-64::arrow-cpp-8.0.0-py310h3098874_0 > aws-c-common pkgs/main/linux-64::aws-c-common-0.4.57-he6710b0_1 > aws-c-event-stream pkgs/main/linux-64::aws-c-event-stream-0.1.6-h2531618_5 > aws-checksums pkgs/main/linux-64::aws-checksums-0.1.9-he6710b0_0 > aws-sdk-cpp pkgs/main/linux-64::aws-sdk-cpp-1.8.185-hce553d0_0 > blas pkgs/main/linux-64::blas-1.0-mkl > boost-cpp pkgs/main/linux-64::boost-cpp-1.73.0-h7f8727e_12 > brotli pkgs/main/linux-64::brotli-1.0.9-he6710b0_2 > c-ares pkgs/main/linux-64::c-ares-1.18.1-h7f8727e_0 > gflags pkgs/main/linux-64::gflags-2.2.2-he6710b0_0 > glog pkgs/main/linux-64::glog-0.5.0-h2531618_0 > grpc-cpp pkgs/main/linux-64::grpc-cpp-1.46.1-h33aed49_0 > icu pkgs/main/linux-64::icu-58.2-he6710b0_3 > intel-openmp pkgs/main/linux-64::intel-openmp-2021.4.0-h06a4308_3561 > krb5 pkgs/main/linux-64::krb5-1.19.2-hac12032_0 > libboost pkgs/main/linux-64::libboost-1.73.0-h28710b8_12 > libcurl pkgs/main/linux-64::libcurl-7.82.0-h0b77cf5_0 > libedit pkgs/main/linux-64::libedit-3.1.20210910-h7f8727e_0 > libev pkgs/main/linux-64::libev-4.33-h7f8727e_1 > libevent pkgs/main/linux-64::libevent-2.1.12-h8f2d780_0 > libnghttp2 pkgs/main/linux-64::libnghttp2-1.46.0-hce63b2e_0 > libprotobuf pkgs/main/linux-64::libprotobuf-3.20.1-h4ff587b_0 > libssh2 pkgs/main/linux-64::libssh2-1.10.0-h8f2d780_0 > libthrift pkgs/main/linux-64::libthrift-0.15.0-hcc01f38_0 > lz4-c pkgs/main/linux-64::lz4-c-1.9.3-h295c915_1 > mkl pkgs/main/linux-64::mkl-2021.4.0-h06a4308_640 > mkl-service pkgs/mai
[jira] [Updated] (ARROW-16692) [C++] StackOverflow in merge generator causes segmentation fault in scan
[ https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-16692: --- Labels: pull-request-available (was: ) > [C++] StackOverflow in merge generator causes segmentation fault in scan > > > Key: ARROW-16692 > URL: https://issues.apache.org/jira/browse/ARROW-16692 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jonathan Keane >Assignee: Weston Pace >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Attachments: backtrace.txt > > Time Spent: 10m > Remaining Estimate: 0h > > I'm still working to make a minimal reproducer for this, though I can > reliably reproduce it below (though that means needing to download a bunch of > data first...). I've cleaned out much of the unnecessary code (so this query > below is a bit silly, and not what I'm actually trying to do), but haven't > been able to make a constructed dataset that reproduces this. > Working on some example with the new | more cleaned taxi dataset at > {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault: > {code} > library(arrow) > library(dplyr) > ds <- open_dataset("path/to/new_taxi/") > ds %>% > filter(!is.na(pickup_location_id)) %>% > summarise(n = n()) %>% collect() > {code} > Most of the time ends in a segfault (though I have gotten it to work on > occasion). I've tried with smaller files | constructed datasets and haven't > been able to replicate it yet. One thing that might be important is: > {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or > so. > I've attached a backtrace in case that's enough to see what's going on here. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16692) [C++] StackOverflow in merge generator causes segmentation fault in scan
[ https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570214#comment-17570214 ] Weston Pace commented on ARROW-16692: - So the trigger seems to be a large sequence of "empty" files. Either the files are truly empty or (I think) it could be that a pushdown filter of some kind eliminated all of the rows in the file. This seems to align with [~jonkeane]'s reproducer, especially the "One thing that might be important is: pickup_location_id is all NAs | nulls in the first 8 years of the data or so." part. The merge generator logic roughly boils down to... {code} def get_next_batch(): if current_file is None: current_file = get_next_file() return get_next_batch_from_file(current_file) def get_next_batch_from_file(file): batch = file.read_batch() if not batch: current_file = None return get_next_batch() return batch {code} The new code looks something like... {code} def get_next_batch(): while True: if current_file is None: current_file = get_next_file() batch = get_next_batch_from_file(current_file) if batch: return batch def get_next_batch_from_file(file): batch = file.read_batch() if not batch: current_file = None return None return batch {code} However, because this is all async, the actual code change looks significantly messier. Sometimes we call {{get_next_batch_from_file}} directly instead of going through {{get_next_batch}} and so we need a new flag to distinguish between the two cases: {code} def get_next_batch_from_file(file, recursive): batch = file.read_batch() if not batch: current_file = None if recursive: return None return get_next_batch() return batch {code} > [C++] StackOverflow in merge generator causes segmentation fault in scan > > > Key: ARROW-16692 > URL: https://issues.apache.org/jira/browse/ARROW-16692 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jonathan Keane >Assignee: Weston Pace >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Attachments: backtrace.txt > > Time Spent: 20m > Remaining Estimate: 0h > > I'm still working to make a minimal reproducer for this, though I can > reliably reproduce it below (though that means needing to download a bunch of > data first...). I've cleaned out much of the unnecessary code (so this query > below is a bit silly, and not what I'm actually trying to do), but haven't > been able to make a constructed dataset that reproduces this. > Working on some example with the new | more cleaned taxi dataset at > {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault: > {code} > library(arrow) > library(dplyr) > ds <- open_dataset("path/to/new_taxi/") > ds %>% > filter(!is.na(pickup_location_id)) %>% > summarise(n = n()) %>% collect() > {code} > Most of the time ends in a segfault (though I have gotten it to work on > occasion). I've tried with smaller files | constructed datasets and haven't > been able to replicate it yet. One thing that might be important is: > {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or > so. > I've attached a backtrace in case that's enough to see what's going on here. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability
[ https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570217#comment-17570217 ] Weston Pace commented on ARROW-17183: - {blockquote} Specifically about this case, we assume that this Fetch and Sort operations are the most external relations in a Substrait plan. Meaning the Sort or Fetch operation is called at the end of the other operations. This is not a very accurate representation. First we need to understand if this is the general case. cc Weston Pace Jeroen van Straten {blockquote} This is the case in TPC-H. It also tends to be the case in real-world queries as it is rather tricky / impossible to write a mid-plan sort in SQL. Subqueries / window functions are the most likely place where you would see a mid-plan sort and we don't support either at the moment. {blockquote} Another approach is that we define a sink node which can execute a function which does the expected operation. In some of the defined Sink nodes (KSelect, Sort) there is a function called `DoFinish`. We should be able to call a custom function within this call. So from Substrait end when we extract the plan, then we can write the required `std::function` which would be an option for this custom sink node. And we assume that a table as input and write the logic. This way we don't have to introduce new nodes. And what if there are different capabilities users need and ACERO has a limitation, can we always keep adding nodes to fullfil that? I am not so sure. This is just a high level thought process. {blockquote} This special sink node would have to collect all of the data in memory first. {blockquote} Although I have implemented a SortAndFetch node which can perform the fetch followed by a sort just by following what is being done in Sort and SelectK nodes. But I am not exactly sure any of these approaches are optimized or the best way to solve the problem. {blockquote} The biggest general problem here is that a top-k node should not have to collect all data in memory. It does have to scan all data but it should be able to throw away data that is obviously larger than K. A SortAndFetch node should also not have to collect all data in memory. Our current implementation does. So what you've described is no worse than our current situation. Yet it is definitely something we should improve at some point. There is ARROW-14202 to improve the top-k node. We could improve SortAndFetch at that time as well. CC [~ArianaVillegas] as this might be something she wants to consider when addressing ARROW-14202 (i.e. we don't just need top-k we need top-k-skipping-m) > [C++] Adding ExecNode with Sort and Fetch capability > > > Key: ARROW-17183 > URL: https://issues.apache.org/jira/browse/ARROW-17183 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > In Substrait integrations with ACERO, a functionality required is the ability > to fetch records sorted and unsorted. > Fetch operation is defined as selecting `K` number of records with an offset. > For instance pick 10 records skipping the first 5 elements. Here we can > define this as a Slice operation and records can be easily extracted in a > sink-node. > Sort and Fetch operation applies when we need to execute a Fetch operation on > sorted data. The main issue is we cannot have a sort node followed by a > fetch. The reason is that all existing node definitions supporting sort are > based on sink nodes. Since there cannot be a node followed by sink, this > functionality has to take place in a single node. > But this is not a perfect solution for fetch and sort, but one way to do this > is define a sink node where the records are sorted and then a set of items > are fetched. > Another dilema is what if sort is followed by a fetch. In that case, there > has to be a flag to enable the order of the operations. > The objective of this ticket is to discuss a viable efficient solution and > include new nodes or a method to execute such a logic. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability
[ https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570217#comment-17570217 ] Weston Pace edited comment on ARROW-17183 at 7/23/22 12:23 AM: --- {quote} Specifically about this case, we assume that this Fetch and Sort operations are the most external relations in a Substrait plan. Meaning the Sort or Fetch operation is called at the end of the other operations. This is not a very accurate representation. First we need to understand if this is the general case. cc Weston Pace Jeroen van Straten {quote} This is the case in TPC-H. It also tends to be the case in real-world queries as it is rather tricky / impossible to write a mid-plan sort in SQL. Subqueries / window functions are the most likely place where you would see a mid-plan sort and we don't support either at the moment. {quote} Another approach is that we define a sink node which can execute a function which does the expected operation. In some of the defined Sink nodes (KSelect, Sort) there is a function called `DoFinish`. We should be able to call a custom function within this call. So from Substrait end when we extract the plan, then we can write the required `std::function` which would be an option for this custom sink node. And we assume that a table as input and write the logic. This way we don't have to introduce new nodes. And what if there are different capabilities users need and ACERO has a limitation, can we always keep adding nodes to fullfil that? I am not so sure. This is just a high level thought process. {quote} This special sink node would have to collect all of the data in memory first. {quote} Although I have implemented a SortAndFetch node which can perform the fetch followed by a sort just by following what is being done in Sort and SelectK nodes. But I am not exactly sure any of these approaches are optimized or the best way to solve the problem. {quote} The biggest general problem here is that a top-k node should not have to collect all data in memory. It does have to scan all data but it should be able to throw away data that is obviously larger than K. A SortAndFetch node should also not have to collect all data in memory. Our current implementation does. So what you've described is no worse than our current situation. Yet it is definitely something we should improve at some point. There is ARROW-14202 to improve the top-k node. We could improve SortAndFetch at that time as well. CC [~ArianaVillegas] as this might be something she wants to consider when addressing ARROW-14202 (i.e. we don't just need top-k we need top-k-skipping-m) was (Author: westonpace): {blockquote} Specifically about this case, we assume that this Fetch and Sort operations are the most external relations in a Substrait plan. Meaning the Sort or Fetch operation is called at the end of the other operations. This is not a very accurate representation. First we need to understand if this is the general case. cc Weston Pace Jeroen van Straten {blockquote} This is the case in TPC-H. It also tends to be the case in real-world queries as it is rather tricky / impossible to write a mid-plan sort in SQL. Subqueries / window functions are the most likely place where you would see a mid-plan sort and we don't support either at the moment. {blockquote} Another approach is that we define a sink node which can execute a function which does the expected operation. In some of the defined Sink nodes (KSelect, Sort) there is a function called `DoFinish`. We should be able to call a custom function within this call. So from Substrait end when we extract the plan, then we can write the required `std::function` which would be an option for this custom sink node. And we assume that a table as input and write the logic. This way we don't have to introduce new nodes. And what if there are different capabilities users need and ACERO has a limitation, can we always keep adding nodes to fullfil that? I am not so sure. This is just a high level thought process. {blockquote} This special sink node would have to collect all of the data in memory first. {blockquote} Although I have implemented a SortAndFetch node which can perform the fetch followed by a sort just by following what is being done in Sort and SelectK nodes. But I am not exactly sure any of these approaches are optimized or the best way to solve the problem. {blockquote} The biggest general problem here is that a top-k node should not have to collect all data in memory. It does have to scan all data but it should be able to throw away data that is obviously larger than K. A SortAndFetch node should also not have to collect all data in memory. Our current implementation does. So what you've described is no worse than our current situation. Yet it is definitely something we should
[jira] [Commented] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability
[ https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570221#comment-17570221 ] Vibhatha Lakmal Abeykoon commented on ARROW-17183: -- This is a clear description on short term goals. I will go ahead and create two PRs for Fetch and SortAndFetch nodes. > [C++] Adding ExecNode with Sort and Fetch capability > > > Key: ARROW-17183 > URL: https://issues.apache.org/jira/browse/ARROW-17183 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > In Substrait integrations with ACERO, a functionality required is the ability > to fetch records sorted and unsorted. > Fetch operation is defined as selecting `K` number of records with an offset. > For instance pick 10 records skipping the first 5 elements. Here we can > define this as a Slice operation and records can be easily extracted in a > sink-node. > Sort and Fetch operation applies when we need to execute a Fetch operation on > sorted data. The main issue is we cannot have a sort node followed by a > fetch. The reason is that all existing node definitions supporting sort are > based on sink nodes. Since there cannot be a node followed by sink, this > functionality has to take place in a single node. > But this is not a perfect solution for fetch and sort, but one way to do this > is define a sink node where the records are sorted and then a set of items > are fetched. > Another dilema is what if sort is followed by a fetch. In that case, there > has to be a flag to enable the order of the operations. > The objective of this ticket is to discuss a viable efficient solution and > include new nodes or a method to execute such a logic. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17190) [C++] Create a Fetch Node
Vibhatha Lakmal Abeykoon created ARROW-17190: Summary: [C++] Create a Fetch Node Key: ARROW-17190 URL: https://issues.apache.org/jira/browse/ARROW-17190 Project: Apache Arrow Issue Type: Sub-task Reporter: Vibhatha Lakmal Abeykoon Assignee: Vibhatha Lakmal Abeykoon The goal of the fetch node is to skip some records and select the given amount of records. This will be implemented as discussed in the parent issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability
[ https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570222#comment-17570222 ] Vibhatha Lakmal Abeykoon commented on ARROW-17183: -- May be just a fetch node with the sort capability which can be left optional. > [C++] Adding ExecNode with Sort and Fetch capability > > > Key: ARROW-17183 > URL: https://issues.apache.org/jira/browse/ARROW-17183 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > In Substrait integrations with ACERO, a functionality required is the ability > to fetch records sorted and unsorted. > Fetch operation is defined as selecting `K` number of records with an offset. > For instance pick 10 records skipping the first 5 elements. Here we can > define this as a Slice operation and records can be easily extracted in a > sink-node. > Sort and Fetch operation applies when we need to execute a Fetch operation on > sorted data. The main issue is we cannot have a sort node followed by a > fetch. The reason is that all existing node definitions supporting sort are > based on sink nodes. Since there cannot be a node followed by sink, this > functionality has to take place in a single node. > But this is not a perfect solution for fetch and sort, but one way to do this > is define a sink node where the records are sorted and then a set of items > are fetched. > Another dilema is what if sort is followed by a fetch. In that case, there > has to be a flag to enable the order of the operations. > The objective of this ticket is to discuss a viable efficient solution and > include new nodes or a method to execute such a logic. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17190) [C++] Create a Fetch Node
[ https://issues.apache.org/jira/browse/ARROW-17190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570223#comment-17570223 ] Vibhatha Lakmal Abeykoon commented on ARROW-17190: -- In addition, it would be better to include the sort-ability as an optional feature. This would avoid adding another node just to take care of sort and fetch operation. Here we can also give a flag to decide the order of things, fetch and sort or sort and fetch. Node name could be just `Fetch`. > [C++] Create a Fetch Node > - > > Key: ARROW-17190 > URL: https://issues.apache.org/jira/browse/ARROW-17190 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Vibhatha Lakmal Abeykoon >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > > The goal of the fetch node is to skip some records and select the given > amount of records. > This will be implemented as discussed in the parent issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)