[jira] [Resolved] (ARROW-16818) [Doc][Python] Document GCS filesystem for PyArrow

2022-07-22 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim resolved ARROW-16818.
-
Resolution: Fixed

Issue resolved by pull request 13681
[https://github.com/apache/arrow/pull/13681]

> [Doc][Python] Document GCS filesystem for PyArrow
> -
>
> Key: ARROW-16818
> URL: https://issues.apache.org/jira/browse/ARROW-16818
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation, Python
>Reporter: Antoine Pitrou
>Assignee: Rok Mihevc
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Followup to ARROW-14892.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17181) [Docs][Python] Scalar UDF Experimental Documentation

2022-07-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17181:
---
Labels: pull-request-available  (was: )

> [Docs][Python] Scalar UDF Experimental Documentation
> 
>
> Key: ARROW-17181
> URL: https://issues.apache.org/jira/browse/ARROW-17181
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation, Python
>Affects Versions: 9.0.0
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> At the moment the existing Scalar UDF usage is not documented. There will be 
> a final version of documentation update once other features are integrated. 
> But to support the users and developers, the existing content needs to be 
> documented. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17159) [C++][JAVA] Dataset: Support reading from fixed offset of a file for Parquet format

2022-07-22 Thread Hongze Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongze Zhang updated ARROW-17159:
-
Description: This adds property start_offset_ and length_ to FileSource and 
should be functional for Parquet dataset format. Supporting Java and C++ 
dataset API at this time.  (was: With that, we can use substrait plan 
ReadRel_LocalFiles_FileOrFiles.start() and length() to pushdown scan filter)

> [C++][JAVA] Dataset: Support reading from fixed offset of a file for Parquet 
> format
> ---
>
> Key: ARROW-17159
> URL: https://issues.apache.org/jira/browse/ARROW-17159
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Java
>Affects Versions: 9.0.0
>Reporter: Jin Chengcheng
>Assignee: Jin Chengcheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This adds property start_offset_ and length_ to FileSource and should be 
> functional for Parquet dataset format. Supporting Java and C++ dataset API at 
> this time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17066) [C++][Substrait] "ignore_unknown_fields" should be specified when converting JSON to binary

2022-07-22 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-17066.

Resolution: Fixed

Issue resolved by pull request 13605
[https://github.com/apache/arrow/pull/13605]

> [C++][Substrait] "ignore_unknown_fields" should be specified when converting 
> JSON to binary
> ---
>
> Key: ARROW-17066
> URL: https://issues.apache.org/jira/browse/ARROW-17066
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Richard Tia
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: pull-request-available, substrait
> Fix For: 9.0.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.json_util#JsonParseOptions]
>  
> When converting a substrait JSON to binary, there are many unknown fields 
> that may exist since substrait is being built every week. 
> ignore_unknown_fields should be specified when doing this conversion.
>  
> This is resulting in frequent errors similar to this:
> {code:java}
> E   pyarrow.lib.ArrowInvalid: JsonToBinaryStream returned 
> INVALID_ARGUMENT:(relations[0].root.input.sort.input.aggregate.measures[0].measure)
>  arguments: Cannot find field.
> pyarrow/error.pxi:100: ArrowInvalid {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-16304) [C++] arrow-dataset-file-parquet-test sporadic failure in appveyor job

2022-07-22 Thread Yibo Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai closed ARROW-16304.

Resolution: Cannot Reproduce

> [C++] arrow-dataset-file-parquet-test sporadic failure in appveyor job
> --
>
> Key: ARROW-16304
> URL: https://issues.apache.org/jira/browse/ARROW-16304
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Weston Pace
>Priority: Major
>
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/43330908/job/yw3djjni6as253m4
> {code:bash}
> [ RUN  ] 
> TestScan/TestParquetFileFormatScan.ScanRecordBatchReaderProjected/0Threaded16b1024r
> C:/projects/arrow/cpp/src/arrow/util/future.cc:323:  Check failed: 
> !IsFutureFinished(state_) Future already marked finished
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability

2022-07-22 Thread Vibhatha Lakmal Abeykoon (Jira)
Vibhatha Lakmal Abeykoon created ARROW-17183:


 Summary: [C++] Adding ExecNode with Sort and Fetch capability
 Key: ARROW-17183
 URL: https://issues.apache.org/jira/browse/ARROW-17183
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Vibhatha Lakmal Abeykoon
Assignee: Vibhatha Lakmal Abeykoon


In Substrait integrations with ACERO, a functionality required is the ability 
to fetch records sorted and unsorted.

Fetch operation is defined as selecting `K` number of records with an offset. 
For instance pick 10 records skipping the first 5 elements. Here we can define 
this as a Slice operation and records can be easily extracted in a sink-node. 

Sort and Fetch operation applies when we need to execute a Fetch operation on 
sorted data. The main issue is we cannot have a sort node followed by a fetch. 
The reason is that all existing node definitions supporting sort are based on 
sink nodes. Since there cannot be a node followed by sink, this functionality 
has to take place in a single node. 

But this is not a perfect solution for fetch and sort, but one way to do this 
is define a sink node where the records are sorted and then a set of items are 
fetched. 

Another dilema is what if sort is followed by a fetch. In that case, there has 
to be a flag to enable the order of the operations. 

The objective of this ticket is to discuss a viable efficient solution and 
include new nodes or a method to execute such a logic.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16692) [C++] StackOverflow in merge generator causes segmentation fault in scan

2022-07-22 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569929#comment-17569929
 ] 

Raúl Cumplido commented on ARROW-16692:
---

This is a blocker for 9.0.0. Is there something we could do to unblock it? I am 
just asking because there has not been much activity on this ticket and I am 
not sure how this is going to affect the release schedule at the moment.

> [C++] StackOverflow in merge generator causes segmentation fault in scan
> 
>
> Key: ARROW-16692
> URL: https://issues.apache.org/jira/browse/ARROW-16692
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jonathan Keane
>Assignee: Weston Pace
>Priority: Blocker
> Fix For: 9.0.0
>
> Attachments: backtrace.txt
>
>
> I'm still working to make a minimal reproducer for this, though I can 
> reliably reproduce it below (though that means needing to download a bunch of 
> data first...). I've cleaned out much of the unnecessary code (so this query 
> below is a bit silly, and not what I'm actually trying to do), but haven't 
> been able to make a constructed dataset that reproduces this.
> Working on some example with the new | more cleaned taxi dataset at 
> {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault:
> {code}
> library(arrow)
> library(dplyr)
> ds <- open_dataset("path/to/new_taxi/")
> ds %>%
>   filter(!is.na(pickup_location_id)) %>%
>   summarise(n = n()) %>% collect()
> {code}
> Most of the time ends in a segfault (though I have gotten it to work on 
> occasion). I've tried with smaller files | constructed datasets and haven't 
> been able to replicate it yet. One thing that might be important is:  
> {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or 
> so.
> I've attached a backtrace in case that's enough to see what's going on here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-13238) [C++][Dataset][Compute] Substitute ExecPlan impl for dataset scans

2022-07-22 Thread Vibhatha Lakmal Abeykoon (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vibhatha Lakmal Abeykoon reassigned ARROW-13238:


Assignee: Vibhatha Lakmal Abeykoon  (was: Ben Kietzman)

> [C++][Dataset][Compute] Substitute ExecPlan impl for dataset scans
> --
>
> Key: ARROW-13238
> URL: https://issues.apache.org/jira/browse/ARROW-13238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> ARROW-11930 grew too large to include substitution of an ExecPlan for 
> existing scan machinery, but this still needs to happen



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-13238) [C++][Dataset][Compute] Substitute ExecPlan impl for dataset scans

2022-07-22 Thread Vibhatha Lakmal Abeykoon (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vibhatha Lakmal Abeykoon reassigned ARROW-13238:


Assignee: Ben Kietzman  (was: Vibhatha Lakmal Abeykoon)

> [C++][Dataset][Compute] Substitute ExecPlan impl for dataset scans
> --
>
> Key: ARROW-13238
> URL: https://issues.apache.org/jira/browse/ARROW-13238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> ARROW-11930 grew too large to include substitution of an ExecPlan for 
> existing scan machinery, but this still needs to happen



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-07-22 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569933#comment-17569933
 ] 

Raúl Cumplido commented on ARROW-15678:
---

Is this still a blocker?

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability

2022-07-22 Thread Vibhatha Lakmal Abeykoon (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569935#comment-17569935
 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-17183:
--

Specifically about this case, we assume that this Fetch and Sort operations are 
the most external relations in a Substrait plan. Meaning the Sort or Fetch 
operation is called at the end of the other operations. This is not a very 
accurate representation. First we need to understand if this is the general 
case. cc [~westonpace] [~jvanstraten] 

There can be a few settings where these two operations are applied. 

*Sort Only*

If the query only has a Sorting operation instead of adding the 
`SinkNodeConsumer`, we need to add a `OrderBySinkNodeConsumer`. 

*Fetch Only*

If the query only has a Fetch operation, we can include a node with fetch 
capability. At the moment we don't have a node with Fetch capability, so this 
may need to be included where we could be able to use a siimilar logic used in 
`SelectK` node.

*SortAndFetch or FetchAndSort*

If the query contains both sort and fetch in a given order, there has to be a 
single Sink node which can do this operation by the given order. When scanning 
the plan components, when we find a sort we just add a `OrderBySink` and keep 
adding other relations. If we find a Fetch operation, this needs to be replaced 
with a SortAndFetch operation where sorting is done first and fetching is done 
next. And this can go vice-versa. 

*Another Approach:*

Another approach is that we define a sink node which can execute a function 
which does the expected operation. In some of the defined Sink nodes (KSelect, 
Sort) there is a function called {*}`{*}DoFinish{*}`.{*} We should be able to 
call a custom function within this call. So from Substrait end when we extract 
the plan, then we can write the required `std::function` which would be an 
option for this custom sink node. And we assume that a table as input and write 
the logic. This way we don't have to introduce new nodes. And what if there are 
different capabilities users need and ACERO has a limitation, can we always 
keep adding nodes to fullfil that? I am not so sure. This is just a high level 
thought process.

Although I have implemented a _SortAndFetch_ node which can perform the fetch 
followed by a sort just by following what is being done in Sort and SelectK 
nodes. But I am not exactly sure any of these approaches are optimized or the 
best way to solve the problem.  

This is the doubtful component which I am not quite clear what would be the 
most optimize way to include this capability. Or if there is a better way to do 
this. Appreciate your thoughts. 

cc [~westonpace] [~jvanstraten] [~bkietz] [~icook] 

> [C++] Adding ExecNode with Sort and Fetch capability
> 
>
> Key: ARROW-17183
> URL: https://issues.apache.org/jira/browse/ARROW-17183
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In Substrait integrations with ACERO, a functionality required is the ability 
> to fetch records sorted and unsorted.
> Fetch operation is defined as selecting `K` number of records with an offset. 
> For instance pick 10 records skipping the first 5 elements. Here we can 
> define this as a Slice operation and records can be easily extracted in a 
> sink-node. 
> Sort and Fetch operation applies when we need to execute a Fetch operation on 
> sorted data. The main issue is we cannot have a sort node followed by a 
> fetch. The reason is that all existing node definitions supporting sort are 
> based on sink nodes. Since there cannot be a node followed by sink, this 
> functionality has to take place in a single node. 
> But this is not a perfect solution for fetch and sort, but one way to do this 
> is define a sink node where the records are sorted and then a set of items 
> are fetched. 
> Another dilema is what if sort is followed by a fetch. In that case, there 
> has to be a flag to enable the order of the operations. 
> The objective of this ticket is to discuss a viable efficient solution and 
> include new nodes or a method to execute such a logic.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17142) [Python] Parquet FileMetadata.equals() method segfaults when passed None

2022-07-22 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim reassigned ARROW-17142:
---

Assignee: Kshiteej K

> [Python] Parquet FileMetadata.equals() method segfaults when passed None
> 
>
> Key: ARROW-17142
> URL: https://issues.apache.org/jira/browse/ARROW-17142
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Kshiteej K
>Assignee: Kshiteej K
>Priority: Major
>  Labels: good-first-issue, pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
>  
> {code:java}
> import pyarrow as pa import pyarrow.parquet as pq
> table = pa.table({"a": [1, 2, 3]}) 
> # Here metadata is None
> metadata = table.schema.metadata
> fname = "data.parquet"
> pq.write_table(table, fname) # Get `metadata`.
> r_metadata = pq.read_metadata(fname)
> # Equals on Metadata segfaults when passed None
> r_metadata.equals(metadata) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17184) [R] Investigate Nodes missing from ExecPlan print out

2022-07-22 Thread Jira
Dragoș Moldovan-Grünfeld created ARROW-17184:


 Summary: [R] Investigate Nodes missing from ExecPlan print out
 Key: ARROW-17184
 URL: https://issues.apache.org/jira/browse/ARROW-17184
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Dragoș Moldovan-Grünfeld


Now that we can print an ExecPlan (ARROW-15016), we can investigate unexpected 
outputs. With the following chunk of code:
{code:r}
mtcars %>%
  arrow_table() %>%
  select(mpg, wt, cyl) %>% 
  filter(mpg > 20) %>%
  arrange(desc(wt)) %>%
  head(3) %>%
  show_exec_plan()
#> ExecPlan with 3 nodes:
#> 2:SinkNode{}
#>   1:ProjectNode{projection=[mpg, wt, cyl]}
#> 0:SourceNode{}
{code}
 * FilterNode disappears when head/tail are involved +
 * we do not have additional information regarding the OrderBySinkNode
 * the entry point is a SourceNode and not a TableSourceNode



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17184) [R] Investigate Nodes missing from ExecPlan print out

2022-07-22 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569947#comment-17569947
 ] 

Dragoș Moldovan-Grünfeld commented on ARROW-17184:
--

A relevant comment - 
https://github.com/apache/arrow/pull/13541#issuecomment-1192085798

> [R] Investigate Nodes missing from ExecPlan print out
> -
>
> Key: ARROW-17184
> URL: https://issues.apache.org/jira/browse/ARROW-17184
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> Now that we can print an ExecPlan (ARROW-15016), we can investigate 
> unexpected outputs. With the following chunk of code:
> {code:r}
> mtcars %>%
>   arrow_table() %>%
>   select(mpg, wt, cyl) %>% 
>   filter(mpg > 20) %>%
>   arrange(desc(wt)) %>%
>   head(3) %>%
>   show_exec_plan()
> #> ExecPlan with 3 nodes:
> #> 2:SinkNode{}
> #>   1:ProjectNode{projection=[mpg, wt, cyl]}
> #> 0:SourceNode{}
> {code}
>  * FilterNode disappears when head/tail are involved +
>  * we do not have additional information regarding the OrderBySinkNode
>  * the entry point is a SourceNode and not a TableSourceNode



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17166) [R] [CI] Remove ENV TZ from docker files

2022-07-22 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569958#comment-17569958
 ] 

Dragoș Moldovan-Grünfeld commented on ARROW-17166:
--

It turns out I misinterpreted the CI output and this was an issue with OOM in 
the large memory tests. See 
[comment|https://github.com/apache/arrow/pull/13680#issuecomment-1191873953].

> [R] [CI] Remove ENV TZ from docker files
> 
>
> Key: ARROW-17166
> URL: https://issues.apache.org/jira/browse/ARROW-17166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, R
>Reporter: Rok Mihevc
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: CI, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> We have noticed R CI job (AMD64 Ubuntu 20.04 R 4.2 Force-Tests true) failing 
> on master: 
> [1|https://github.com/apache/arrow/runs/7424773120?check_suite_focus=true#step:7:5547],
>  
> [2|https://github.com/apache/arrow/runs/7431821192?check_suite_focus=true#step:7:5804],
>  
> [3|https://github.com/apache/arrow/runs/7445803518?check_suite_focus=true#step:7:16305]
> with:
> {code:java}
> Start test: array uses local timezone for POSIXct without timezone
>   test-Array.R:269:3 [success]
> System has not been booted with systemd as init system (PID 1). Can't operate.
> Failed to create bus connection: Host is down
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17166) [R] [CI] Exclude large memory tests from the force-tests job on CI

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld updated ARROW-17166:
-
Summary: [R] [CI] Exclude large memory tests from the force-tests job on CI 
 (was: [R] [CI] Remove ENV TZ from docker files)

> [R] [CI] Exclude large memory tests from the force-tests job on CI
> --
>
> Key: ARROW-17166
> URL: https://issues.apache.org/jira/browse/ARROW-17166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, R
>Reporter: Rok Mihevc
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: CI, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> We have noticed R CI job (AMD64 Ubuntu 20.04 R 4.2 Force-Tests true) failing 
> on master: 
> [1|https://github.com/apache/arrow/runs/7424773120?check_suite_focus=true#step:7:5547],
>  
> [2|https://github.com/apache/arrow/runs/7431821192?check_suite_focus=true#step:7:5804],
>  
> [3|https://github.com/apache/arrow/runs/7445803518?check_suite_focus=true#step:7:16305]
> with:
> {code:java}
> Start test: array uses local timezone for POSIXct without timezone
>   test-Array.R:269:3 [success]
> System has not been booted with systemd as init system (PID 1). Can't operate.
> Failed to create bus connection: Host is down
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17171) [C++][Gandiva] Implement NextDay Function to case-insensitive

2022-07-22 Thread Vinicius Souza Roque (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinicius Souza Roque updated ARROW-17171:
-
Summary: [C++][Gandiva] Implement NextDay Function to case-insensitive  
(was: [C++][Gandiva] Implement case-insensitive)

> [C++][Gandiva] Implement NextDay Function to case-insensitive
> -
>
> Key: ARROW-17171
> URL: https://issues.apache.org/jira/browse/ARROW-17171
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva
>Reporter: Vinicius Souza Roque
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Implementing changes for the function to be case-insensitive



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-07-22 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569968#comment-17569968
 ] 

Antoine Pitrou commented on ARROW-15678:


Ideally it would... But there's little chance for it to be fixed in time for 
9.0.0.

As I said above, the workaround should be to disable runtime SIMD optimizations 
on the affected builds. Somehow has to validate that suggestion, though (i.e. 
someone who's able to reproduce this issue).

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16444) [R] Implement user-defined scalar functions in R bindings

2022-07-22 Thread Dewey Dunnington (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dewey Dunnington resolved ARROW-16444.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13397
[https://github.com/apache/arrow/pull/13397]

> [R] Implement user-defined scalar functions in R bindings
> -
>
> Key: ARROW-16444
> URL: https://issues.apache.org/jira/browse/ARROW-16444
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 26h 10m
>  Remaining Estimate: 0h
>
> In ARROW-15639, user-defined (scalar) functions were implemented for Python. 
> In ARROW-15841 and ARROW-15168 we developed some tooling and strategies for 
> calling into R from non-R threads, so in theory we should be able to mirror 
> the Python implementation (possibly with the constraint that we have to 
> provide a way to return a {{Table}} instead of a {{RecordBatchReader}} if 
> there are any user-defined functions, otherwise we can't guarantee the 
> existence of an event loop to do the R evaluating?).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-07-22 Thread Jacob Wujciak-Jens (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569991#comment-17569991
 ] 

Jacob Wujciak-Jens commented on ARROW-15678:


Looking at [ARROW-15664] and this 
[PR|https://github.com/apache/arrow/pull/12364/files#diff-ca50d864d033146f9135f2fc25ae337322982dd340c6fa25b1efe9f0c02db870]
 it seems like a workaround has been implemented for homebrew IIUC, so this is 
still an issue but as the real fix wont happen for 9.0.0 it shouldn't be a 
blocker anymore?

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-07-22 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569996#comment-17569996
 ] 

Antoine Pitrou commented on ARROW-15678:


If that was actually accepted by Homebrew then fine.

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17184) [R] Investigate Nodes missing from ExecPlan print out for nested queries

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld updated ARROW-17184:
-
Summary: [R] Investigate Nodes missing from ExecPlan print out for nested 
queries  (was: [R] Investigate Nodes missing from ExecPlan print out)

> [R] Investigate Nodes missing from ExecPlan print out for nested queries
> 
>
> Key: ARROW-17184
> URL: https://issues.apache.org/jira/browse/ARROW-17184
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> Now that we can print an ExecPlan (ARROW-15016), we can investigate 
> unexpected outputs. With the following chunk of code:
> {code:r}
> mtcars %>%
>   arrow_table() %>%
>   select(mpg, wt, cyl) %>% 
>   filter(mpg > 20) %>%
>   arrange(desc(wt)) %>%
>   head(3) %>%
>   show_exec_plan()
> #> ExecPlan with 3 nodes:
> #> 2:SinkNode{}
> #>   1:ProjectNode{projection=[mpg, wt, cyl]}
> #> 0:SourceNode{}
> {code}
>  * FilterNode disappears when head/tail are involved +
>  * we do not have additional information regarding the OrderBySinkNode
>  * the entry point is a SourceNode and not a TableSourceNode



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17184) [R] Investigate Nodes missing from ExecPlan print out for nested queries

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld updated ARROW-17184:
-
Description: 
*Update:* We currently do not print a plan for nested queries as that would 
involve the execution of the inner most query / queries. That is likely the 
reason why, in the chunk below, we do not see the filter and arrange nodes.

==

*Original description:*

Now that we can print an ExecPlan (ARROW-15016), we can investigate unexpected 
outputs. With the following chunk of code:
{code:r}
mtcars %>%
  arrow_table() %>%
  select(mpg, wt, cyl) %>% 
  filter(mpg > 20) %>%
  arrange(desc(wt)) %>%
  head(3) %>%
  show_exec_plan()
#> ExecPlan with 3 nodes:
#> 2:SinkNode{}
#>   1:ProjectNode{projection=[mpg, wt, cyl]}
#> 0:SourceNode{}
{code}
 * FilterNode disappears when head/tail are involved +
 * we do not have additional information regarding the OrderBySinkNode
 * the entry point is a SourceNode and not a TableSourceNode

  was:
Now that we can print an ExecPlan (ARROW-15016), we can investigate unexpected 
outputs. With the following chunk of code:
{code:r}
mtcars %>%
  arrow_table() %>%
  select(mpg, wt, cyl) %>% 
  filter(mpg > 20) %>%
  arrange(desc(wt)) %>%
  head(3) %>%
  show_exec_plan()
#> ExecPlan with 3 nodes:
#> 2:SinkNode{}
#>   1:ProjectNode{projection=[mpg, wt, cyl]}
#> 0:SourceNode{}
{code}
 * FilterNode disappears when head/tail are involved +
 * we do not have additional information regarding the OrderBySinkNode
 * the entry point is a SourceNode and not a TableSourceNode


> [R] Investigate Nodes missing from ExecPlan print out for nested queries
> 
>
> Key: ARROW-17184
> URL: https://issues.apache.org/jira/browse/ARROW-17184
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> *Update:* We currently do not print a plan for nested queries as that would 
> involve the execution of the inner most query / queries. That is likely the 
> reason why, in the chunk below, we do not see the filter and arrange nodes.
> ==
> *Original description:*
> Now that we can print an ExecPlan (ARROW-15016), we can investigate 
> unexpected outputs. With the following chunk of code:
> {code:r}
> mtcars %>%
>   arrow_table() %>%
>   select(mpg, wt, cyl) %>% 
>   filter(mpg > 20) %>%
>   arrange(desc(wt)) %>%
>   head(3) %>%
>   show_exec_plan()
> #> ExecPlan with 3 nodes:
> #> 2:SinkNode{}
> #>   1:ProjectNode{projection=[mpg, wt, cyl]}
> #> 0:SourceNode{}
> {code}
>  * FilterNode disappears when head/tail are involved +
>  * we do not have additional information regarding the OrderBySinkNode
>  * the entry point is a SourceNode and not a TableSourceNode



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16202) [C++][Parquet] WipeOutDecryptionKeys doesn't securely wipe out keys

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-16202:
--
Fix Version/s: 10.0.0
   (was: 9.0.0)

> [C++][Parquet] WipeOutDecryptionKeys doesn't securely wipe out keys
> ---
>
> Key: ARROW-16202
> URL: https://issues.apache.org/jira/browse/ARROW-16202
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Affects Versions: 7.0.0
>Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: 10.0.0
>
>
> {{InternalFileDecryptor::WipeOutDecryptionKeys()}} merely call 
> {{std::string::clear}} to dispose of the decryption key contents, but that 
> method is not guaranteed to clear memory (it probably doesn't, actually).
> We should probably devise a portable wrapper function for the various 
> OS-specific memory clearing utilities.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16795) [C#][Flight] Nightly verify-rc-source-csharp-macos-arm64 fails

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-16795:
--
Fix Version/s: (was: 9.0.0)

> [C#][Flight] Nightly verify-rc-source-csharp-macos-arm64 fails
> --
>
> Key: ARROW-16795
> URL: https://issues.apache.org/jira/browse/ARROW-16795
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#, FlightRPC
>Affects Versions: 9.0.0
>Reporter: Raúl Cumplido
>Priority: Critical
>
> The "verify-rc-source-csharp-macos-arm64" job has been failing on and off
> since ~may 18th, the issue seems to be with the flight tests.
> {code:java}
>  Failed Apache.Arrow.Flight.Tests.FlightTests.TestGetFlightMetadata [567 ms]
>   Error Message:
>    Grpc.Core.RpcException : Status(StatusCode="Internal", Detail="Error 
> starting gRPC call. HttpRequestException: An HTTP/2 connection could not be 
> established because the server did not complete the HTTP/2 handshake.", 
> DebugException="System.Net.Http.HttpRequestException: An HTTP/2 connection 
> could not be established because the server did not complete the HTTP/2 
> handshake.
>    at 
> System.Net.Http.HttpConnectionPool.ReturnHttp2Connection(Http2Connection 
> connection, Boolean isNewConnection)
>    at 
> System.Net.Http.HttpConnectionPool.AddHttp2ConnectionAsync(HttpRequestMessage 
> request)
>    at 
> System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object
>  s)
>    at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread 
> threadPoolThread, ExecutionContext executionContext, ContextCallback 
> callback, Object state)
>    at 
> System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread
>  threadPoolThread)
>    at 
> System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecuteFromThreadPool(Thread
>  threadPoolThread)
>    at System.Threading.ThreadPoolWorkQueue.Dispatch()
>    at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
>    at System.Threading.Thread.StartCallback()
> --- End of stack trace from previous location ---
>    at 
> System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken
>  cancellationToken)
>    at 
> System.Net.Http.HttpConnectionPool.GetHttp2ConnectionAsync(HttpRequestMessage 
> request, Boolean async, CancellationToken cancellationToken)
>    at 
> System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage
>  request, Boolean async, Boolean doRequestAuth, CancellationToken 
> cancellationToken)
>    at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, 
> Boolean async, CancellationToken cancellationToken)
>    at Grpc.Net.Client.Internal.GrpcCall`2.RunCall(HttpRequestMessage request, 
> Nullable`1 timeout)")
>   Stack Trace:
>      at Grpc.Net.Client.Internal.GrpcCall`2.GetResponseHeadersCoreAsync()
>    at 
> Apache.Arrow.Flight.Client.FlightClient.<>c.d.MoveNext() in 
> /Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/csharp/src/Apache.Arrow.Flight/Client/FlightClient.cs:line
>  71
> --- End of stack trace from previous location ---
>    at Apache.Arrow.Flight.Tests.FlightTests.TestGetFlightMetadata() in 
> /Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/csharp/test/Apache.Arrow.Flight.Tests/FlightTests.cs:line
>  183
> --- End of stack trace from previous location ---
>   Failed Apache.Arrow.Flight.Tests.FlightTests.TestGetSchema [108 ms]
>   Error Message:
>    Grpc.Core.RpcException : Status(StatusCode="Internal", Detail="Error 
> starting gRPC call. HttpRequestException: An HTTP/2 connection could not be 
> established because the server did not complete the HTTP/2 handshake.", 
> DebugException="System.Net.Http.HttpRequestException: An HTTP/2 connection 
> could not be established because the server did not complete the HTTP/2 
> handshake.
>    at 
> System.Net.Http.HttpConnectionPool.ReturnHttp2Connection(Http2Connection 
> connection, Boolean isNewConnection)
>    at 
> System.Net.Http.HttpConnectionPool.AddHttp2ConnectionAsync(HttpRequestMessage 
> request)
>    at 
> System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine&
>  stateMachine)
>    at 
> System.Net.Http.HttpConnectionPool.AddHttp2ConnectionAsync(HttpRequestMessage 
> request)
>    at 
> System.Net.Http.HttpConnectionPool.<>c__DisplayClass78_0.b__0()
>    at System.Threading.Tasks.Task`1.InnerInvoke()
>    at System.Threading.Tasks.Task.<>c.<.cctor>b__272_0(Object obj)
>    at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread 
> threadPoolThread, ExecutionContext executionContext, ContextCallback 
> callback, 

[jira] [Updated] (ARROW-16919) [C++] Flight integration tests fail on verify rc nightly on linux amd64

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-16919:
--
Fix Version/s: (was: 9.0.0)

> [C++] Flight integration tests fail on verify rc nightly on linux amd64
> ---
>
> Key: ARROW-16919
> URL: https://issues.apache.org/jira/browse/ARROW-16919
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration, FlightRPC
>Reporter: Raúl Cumplido
>Priority: Critical
>  Labels: Nightly, pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Some of our nightly builds to verify the release are failing:
> {color:#1d1c1d}- 
> {color}[verify-rc-source-integration-linux-almalinux-8-amd64|https://github.com/ursacomputing/crossbow/runs/7073206980?check_suite_focus=true]
> {color:#1d1c1d}- 
> {color}[verify-rc-source-integration-linux-ubuntu-18.04-amd64|https://github.com/ursacomputing/crossbow/runs/7073217433?check_suite_focus=true]
> {color:#1d1c1d}- 
> {color}[verify-rc-source-integration-linux-ubuntu-20.04-amd64|https://github.com/ursacomputing/crossbow/runs/7073210299?check_suite_focus=true]
> {color:#1d1c1d}- 
> {color}[verify-rc-source-integration-linux-ubuntu-22.04-amd64|https://github.com/ursacomputing/crossbow/runs/7073273051?check_suite_focus=true]
> with the following:
> {code:java}
>  # FAILURES #
> FAILED TEST: middleware C++ producing,  C++ consuming
> 1 failures
>   File "/arrow/dev/archery/archery/integration/util.py", line 139, in run_cmd
>     output = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
>   File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
>     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
>   File "/usr/lib/python3.8/subprocess.py", line 512, in run
>     raise CalledProcessError(retcode, process.args,
> subprocess.CalledProcessError: Command 
> '['/tmp/arrow-HEAD.PZocX/cpp-build/release/flight-test-integration-client', 
> '-host', 'localhost', '-port=36719', '-scenario', 'middleware']' died with 
> .
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "/arrow/dev/archery/archery/integration/runner.py", line 379, in 
> _run_flight_test_case
>     consumer.flight_request(port, **client_args)
>   File "/arrow/dev/archery/archery/integration/tester_cpp.py", line 134, in 
> flight_request
>     run_cmd(cmd)
>   File "/arrow/dev/archery/archery/integration/util.py", line 148, in run_cmd
>     raise RuntimeError(sio.getvalue())
> RuntimeError: Command failed: 
> /tmp/arrow-HEAD.PZocX/cpp-build/release/flight-test-integration-client -host 
> localhost -port=36719 -scenario middleware
> With output:
> --
> Headers received successfully on failing call.
> Headers received successfully on passing call.
> free(): double free detected in tcache 2 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16727) [C++] Bump version of bundled AWS SDK

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-16727:
--
Fix Version/s: 10.0.0

> [C++] Bump version of bundled AWS SDK
> -
>
> Key: ARROW-16727
> URL: https://issues.apache.org/jira/browse/ARROW-16727
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 10.0.0
>
>
> The latest version on the 1.8 line is 1.8.186.
> We could also try to switch to the 1.9 line, but there were blocking issues 
> last time we tried.
> Note we should bump the dependent AWS library versions at the same time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16724) [C++] Bump versions of bundled dependencies

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-16724:
--
Fix Version/s: 10.0.0
   (was: 9.0.0)

> [C++] Bump versions of bundled dependencies
> ---
>
> Key: ARROW-16724
> URL: https://issues.apache.org/jira/browse/ARROW-16724
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: 10.0.0
>
>
> We should bump bundled dependencies to their latest respective versions 
> before 9.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16339) [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to Arrow Schema metadata

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-16339:
--
Fix Version/s: 10.0.0
   (was: 9.0.0)

> [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to 
> Arrow Schema metadata
> -
>
> Key: ARROW-16339
> URL: https://issues.apache.org/jira/browse/ARROW-16339
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet, Python
>Reporter: Joris Van den Bossche
>Priority: Critical
> Fix For: 10.0.0
>
>
> Context: I ran into this issue when reading Parquet files created by GDAL 
> (using the Arrow C++ APIs, [https://github.com/OSGeo/gdal/pull/5477]), which 
> writes files that have custom key_value_metadata, but without storing 
> ARROW:schema in those metadata (cc [~paleolimbot]
> —
> Both in reading and writing files, I expected that we would map Arrow 
> {{Schema::metadata}} with Parquet {{{}FileMetaData::key_value_metadata{}}}. 
> But apparently this doesn't (always) happen out of the box, and only happens 
> through the "ARROW:schema" field (which stores the original Arrow schema, and 
> thus the metadata stored in this schema).
> For example, when writing a Table with schema metadata, this is not stored 
> directly in the Parquet FileMetaData (code below is using branch from 
> ARROW-16337 to have the {{store_schema}} keyword):
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'a': [1, 2, 3]}, metadata={"key": "value"})
> pq.write_table(table, "test_metadata_with_arrow_schema.parquet")
> pq.write_table(table, "test_metadata_without_arrow_schema.parquet", 
> store_schema=False)
> # original schema has metadata
> >>> table.schema
> a: int64
> -- schema metadata --
> key: 'value'
> # reading back only has the metadata in case we stored ARROW:schema
> >>> pq.read_table("test_metadata_with_arrow_schema.parquet").schema
> a: int64
> -- schema metadata --
> key: 'value'
> # and not if ARROW:schema is absent
> >>> pq.read_table("test_metadata_without_arrow_schema.parquet").schema
> a: int64
> {code}
> It seems that if we store the ARROW:schema, we _also_ store the schema 
> metadata separately. But if {{store_schema}} is False, we also stop writing 
> those metadata (not fully sure if this is the intended behaviour, and that's 
> the reason for the above output):
> {code:python}
> # when storing the ARROW:schema, we ALSO store key:value metadata
> >>> pq.read_metadata("test_metadata_with_arrow_schema.parquet").metadata
> {b'ARROW:schema': b'/7AQAAAKAA4ABgAFAA...',
>  b'key': b'value'}
> # when not storing the schema, we also don't store the key:value
> >>> pq.read_metadata("test_metadata_without_arrow_schema.parquet").metadata 
> >>> is None
> True
> {code}
> On the reading side, it seems that we generally do read custom key/value 
> metadata into schema metadata. We don't have the pyarrow APIs at the moment 
> to create such a file (given the above), but with a small patch I could 
> create such a file:
> {code:python}
> # a Parquet file with ParquetFileMetaData::metadata that ONLY has a custom key
> >>> pq.read_metadata("test_metadata_without_arrow_schema2.parquet").metadata
> {b'key': b'value'}
> # this metadata is now correctly mapped to the Arrow schema metadata
> >>> pq.read_schema("test_metadata_without_arrow_schema2.parquet")
> a: int64
> -- schema metadata --
> key: 'value'
> {code}
> But if you have a file that has both custom key/value metadata and an 
> "ARROW:schema" key, we actually ignore the custom keys, and only look at the 
> "ARROW:schema" one. 
> This was the case that I ran into with GDAL, where I have a file with both 
> keys, but where the custom "geo" key is not also included in the serialized 
> arrow schema in the "ARROW:schema" key:
> {code:python}
> # includes both keys in the Parquet file
> >>> pq.read_metadata("test_gdal.parquet").metadata
> {b'geo': b'{"version":"0.1.0","...',
>  b'ARROW:schema': b'/3gBAAAQ...'}
> # the "geo" key is lost in the Arrow schema
> >>> pq.read_table("test_gdal.parquet").schema.metadata is None
> True
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17164) [C++] Expose higher-level utility to execute a kernel

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-17164:
--
Fix Version/s: 10.0.0
   (was: 9.0.0)

> [C++] Expose higher-level utility to execute a kernel
> -
>
> Key: ARROW-17164
> URL: https://issues.apache.org/jira/browse/ARROW-17164
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 10.0.0
>
>
> Currently, the compute layer exposes several high-level facilities to execute 
> a compute function: {{CallFunction}} and {{Function::Execute}}.
> However, if you'd favor a two-step approach of first resolving the {{Kernel}} 
> for a given set of argument types, then execute the kernel, then you're 
> forced to deal with the rather cumbersome {{Kernel}} execution interface.
> It would be nice if the base {{Kernel}} class had something similar to the 
> {{Function::Execute}} method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17170) [C++][Docs] Research Documentation Formats

2022-07-22 Thread Kae Suarez (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kae Suarez updated ARROW-17170:
---
Summary: [C++][Docs] Research Documentation Formats  (was: Research 
Documentation Formats)

> [C++][Docs] Research Documentation Formats
> --
>
> Key: ARROW-17170
> URL: https://issues.apache.org/jira/browse/ARROW-17170
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Documentation
>Reporter: Kae Suarez
>Assignee: Kae Suarez
>Priority: Major
>
> In order to revise the documentation, some inspiration is needed to get the 
> format right. This ticket provides a space for exploration of possible 
> inspiration for the C++ documentation – once we have some good examples 
> and/or agreement, we can move to some content creation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17142) [Python] Parquet FileMetadata.equals() method segfaults when passed None

2022-07-22 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-17142.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13658
[https://github.com/apache/arrow/pull/13658]

> [Python] Parquet FileMetadata.equals() method segfaults when passed None
> 
>
> Key: ARROW-17142
> URL: https://issues.apache.org/jira/browse/ARROW-17142
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Kshiteej K
>Assignee: Kshiteej K
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
>  
> {code:java}
> import pyarrow as pa import pyarrow.parquet as pq
> table = pa.table({"a": [1, 2, 3]}) 
> # Here metadata is None
> metadata = table.schema.metadata
> fname = "data.parquet"
> pq.write_table(table, fname) # Get `metadata`.
> r_metadata = pq.read_metadata(fname)
> # Equals on Metadata segfaults when passed None
> r_metadata.equals(metadata) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17185) [R] [C++] Remove duplicated code in ExecPlan_BuildAndShow and ExecPla$BuildAndShow()

2022-07-22 Thread Jira
Dragoș Moldovan-Grünfeld created ARROW-17185:


 Summary: [R] [C++] Remove duplicated code in ExecPlan_BuildAndShow 
and ExecPla$BuildAndShow()
 Key: ARROW-17185
 URL: https://issues.apache.org/jira/browse/ARROW-17185
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, R
Affects Versions: 8.0.0
Reporter: Dragoș Moldovan-Grünfeld






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17185) [R] [C++] Remove duplicated code in ExecPlan_BuildAndShow and ExecPlan$BuildAndShow()

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld updated ARROW-17185:
-
Summary: [R] [C++] Remove duplicated code in ExecPlan_BuildAndShow and 
ExecPlan$BuildAndShow()  (was: [R] [C++] Remove duplicated code in 
ExecPlan_BuildAndShow and ExecPla$BuildAndShow())

> [R] [C++] Remove duplicated code in ExecPlan_BuildAndShow and 
> ExecPlan$BuildAndShow()
> -
>
> Key: ARROW-17185
> URL: https://issues.apache.org/jira/browse/ARROW-17185
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Affects Versions: 8.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17185) [R] [C++] Remove duplicated code in ExecPlan_BuildAndShow and ExecPlan$BuildAndShow()

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld updated ARROW-17185:
-
Description: 
When adding a print method for ExecPlans in R we chose to copy some of the code 
from {{ExecPlan_prepare}} to {{Exec_Plan_BuildAndShow}} and from 
{{ExecPlan$Run()}} to {{{}ExecPlan$BuildAndShow(){}}}. 

Relevant PR: https://github.com/apache/arrow/pull/13541

> [R] [C++] Remove duplicated code in ExecPlan_BuildAndShow and 
> ExecPlan$BuildAndShow()
> -
>
> Key: ARROW-17185
> URL: https://issues.apache.org/jira/browse/ARROW-17185
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Affects Versions: 8.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> When adding a print method for ExecPlans in R we chose to copy some of the 
> code from {{ExecPlan_prepare}} to {{Exec_Plan_BuildAndShow}} and from 
> {{ExecPlan$Run()}} to {{{}ExecPlan$BuildAndShow(){}}}. 
> Relevant PR: https://github.com/apache/arrow/pull/13541



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17185) [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and ExecPlan$BuildAndShow()

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld updated ARROW-17185:
-
Component/s: C++

> [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and 
> ExecPlan$BuildAndShow()
> ---
>
> Key: ARROW-17185
> URL: https://issues.apache.org/jira/browse/ARROW-17185
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Affects Versions: 8.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> When adding a print method for ExecPlans in R we chose to copy some of the 
> code from {{ExecPlan_prepare}} to {{Exec_Plan_BuildAndShow}} and from 
> {{ExecPlan$Run()}} to {{{}ExecPlan$BuildAndShow(){}}}. 
> Relevant PR: https://github.com/apache/arrow/pull/13541



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17185) [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and ExecPlan$BuildAndShow()

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld updated ARROW-17185:
-
Component/s: (was: C++)

> [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and 
> ExecPlan$BuildAndShow()
> ---
>
> Key: ARROW-17185
> URL: https://issues.apache.org/jira/browse/ARROW-17185
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> When adding a print method for ExecPlans in R we chose to copy some of the 
> code from {{ExecPlan_prepare}} to {{Exec_Plan_BuildAndShow}} and from 
> {{ExecPlan$Run()}} to {{{}ExecPlan$BuildAndShow(){}}}. 
> Relevant PR: https://github.com/apache/arrow/pull/13541



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17185) [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and ExecPlan$BuildAndShow()

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld updated ARROW-17185:
-
Summary: [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and 
ExecPlan$BuildAndShow()  (was: [R] [C++] Remove duplicated code in 
ExecPlan_BuildAndShow and ExecPlan$BuildAndShow())

> [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and 
> ExecPlan$BuildAndShow()
> ---
>
> Key: ARROW-17185
> URL: https://issues.apache.org/jira/browse/ARROW-17185
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Affects Versions: 8.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> When adding a print method for ExecPlans in R we chose to copy some of the 
> code from {{ExecPlan_prepare}} to {{Exec_Plan_BuildAndShow}} and from 
> {{ExecPlan$Run()}} to {{{}ExecPlan$BuildAndShow(){}}}. 
> Relevant PR: https://github.com/apache/arrow/pull/13541



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17104) [CI][Python] Pyarrow cannot be imported on CI job AMD64 MacOS 10.15 Python 3

2022-07-22 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-17104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570088#comment-17570088
 ] 

Raúl Cumplido commented on ARROW-17104:
---

This is not critical anymore as the MAC OS build has been fixed with the 
upstream homebrew package update for protobuf. I am not closing it as we might 
want to pick some of the changes from the initial PR.

> [CI][Python] Pyarrow cannot be imported on CI job AMD64 MacOS 10.15 Python 3 
> -
>
> Key: ARROW-17104
> URL: https://issues.apache.org/jira/browse/ARROW-17104
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Raúl Cumplido
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> The *AMD64 MacOS 10.15 Python 3*  job has started to fail on master with the 
> following error:
> {code:java}
>  + pytest -r s -v --pyargs pyarrow
> = test session starts 
> ==
> platform darwin -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0 -- 
> /usr/local/opt/python@3.9/bin/python3.9
> cachedir: .pytest_cache
> hypothesis profile 'default' -> 
> database=DirectoryBasedExampleDatabase('/Users/runner/work/arrow/arrow/.hypothesis/examples')
> rootdir: /Users/runner/work/arrow/arrow
> plugins: hypothesis-6.52.1, lazy-fixture-0.6.3
> collecting ... collected 0 items / 1 error
>  ERRORS 
> 
>  ERROR collecting test session 
> _
> /usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py:127:
>  in import_module
>     return _bootstrap._gcd_import(name[level:], package, level)
> :1030: in _gcd_import
>     ???
> :1007: in _find_and_load
>     ???
> :972: in _find_and_load_unlocked
>     ???
> :228: in _call_with_frames_removed
>     ???
> :1030: in _gcd_import
>     ???
> :1007: in _find_and_load
>     ???
> :986: in _find_and_load_unlocked
>     ???
> :680: in _load_unlocked
>     ???
> :850: in exec_module
>     ???
> :228: in _call_with_frames_removed
>     ???
> /usr/local/lib/python3.9/site-packages/pyarrow/__init__.py:65: in 
>     import pyarrow.lib as _lib
> E   ImportError: 
> dlopen(/usr/local/lib/python3.9/site-packages/pyarrow/lib.cpython-39-darwin.so,
>  2): Symbol not found: __ZN6google8protobuf8internal16InternalMetadataD1Ev
> E     Referenced from: /usr/local/lib/libarrow.900.dylib
> E     Expected in: flat namespace
> E    in /usr/local/lib/libarrow.900.dylib
>  Interrupted: 1 error during collection 
> 
> === 1 error in 5.18s 
> ===
> Error: Process completed with exit code 2. {code}
> See an example of build on arrow/master branch: 
> https://github.com/apache/arrow/runs/7385879183?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17104) [CI][Python] Pyarrow cannot be imported on CI job AMD64 MacOS 10.15 Python 3

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-17104:
--
Fix Version/s: 10.0.0
   (was: 9.0.0)

> [CI][Python] Pyarrow cannot be imported on CI job AMD64 MacOS 10.15 Python 3 
> -
>
> Key: ARROW-17104
> URL: https://issues.apache.org/jira/browse/ARROW-17104
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Raúl Cumplido
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> The *AMD64 MacOS 10.15 Python 3*  job has started to fail on master with the 
> following error:
> {code:java}
>  + pytest -r s -v --pyargs pyarrow
> = test session starts 
> ==
> platform darwin -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0 -- 
> /usr/local/opt/python@3.9/bin/python3.9
> cachedir: .pytest_cache
> hypothesis profile 'default' -> 
> database=DirectoryBasedExampleDatabase('/Users/runner/work/arrow/arrow/.hypothesis/examples')
> rootdir: /Users/runner/work/arrow/arrow
> plugins: hypothesis-6.52.1, lazy-fixture-0.6.3
> collecting ... collected 0 items / 1 error
>  ERRORS 
> 
>  ERROR collecting test session 
> _
> /usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py:127:
>  in import_module
>     return _bootstrap._gcd_import(name[level:], package, level)
> :1030: in _gcd_import
>     ???
> :1007: in _find_and_load
>     ???
> :972: in _find_and_load_unlocked
>     ???
> :228: in _call_with_frames_removed
>     ???
> :1030: in _gcd_import
>     ???
> :1007: in _find_and_load
>     ???
> :986: in _find_and_load_unlocked
>     ???
> :680: in _load_unlocked
>     ???
> :850: in exec_module
>     ???
> :228: in _call_with_frames_removed
>     ???
> /usr/local/lib/python3.9/site-packages/pyarrow/__init__.py:65: in 
>     import pyarrow.lib as _lib
> E   ImportError: 
> dlopen(/usr/local/lib/python3.9/site-packages/pyarrow/lib.cpython-39-darwin.so,
>  2): Symbol not found: __ZN6google8protobuf8internal16InternalMetadataD1Ev
> E     Referenced from: /usr/local/lib/libarrow.900.dylib
> E     Expected in: flat namespace
> E    in /usr/local/lib/libarrow.900.dylib
>  Interrupted: 1 error during collection 
> 
> === 1 error in 5.18s 
> ===
> Error: Process completed with exit code 2. {code}
> See an example of build on arrow/master branch: 
> https://github.com/apache/arrow/runs/7385879183?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17104) [CI][Python] Pyarrow cannot be imported on CI job AMD64 MacOS 10.15 Python 3

2022-07-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-17104:
--
Priority: Major  (was: Critical)

> [CI][Python] Pyarrow cannot be imported on CI job AMD64 MacOS 10.15 Python 3 
> -
>
> Key: ARROW-17104
> URL: https://issues.apache.org/jira/browse/ARROW-17104
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Raúl Cumplido
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> The *AMD64 MacOS 10.15 Python 3*  job has started to fail on master with the 
> following error:
> {code:java}
>  + pytest -r s -v --pyargs pyarrow
> = test session starts 
> ==
> platform darwin -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0 -- 
> /usr/local/opt/python@3.9/bin/python3.9
> cachedir: .pytest_cache
> hypothesis profile 'default' -> 
> database=DirectoryBasedExampleDatabase('/Users/runner/work/arrow/arrow/.hypothesis/examples')
> rootdir: /Users/runner/work/arrow/arrow
> plugins: hypothesis-6.52.1, lazy-fixture-0.6.3
> collecting ... collected 0 items / 1 error
>  ERRORS 
> 
>  ERROR collecting test session 
> _
> /usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py:127:
>  in import_module
>     return _bootstrap._gcd_import(name[level:], package, level)
> :1030: in _gcd_import
>     ???
> :1007: in _find_and_load
>     ???
> :972: in _find_and_load_unlocked
>     ???
> :228: in _call_with_frames_removed
>     ???
> :1030: in _gcd_import
>     ???
> :1007: in _find_and_load
>     ???
> :986: in _find_and_load_unlocked
>     ???
> :680: in _load_unlocked
>     ???
> :850: in exec_module
>     ???
> :228: in _call_with_frames_removed
>     ???
> /usr/local/lib/python3.9/site-packages/pyarrow/__init__.py:65: in 
>     import pyarrow.lib as _lib
> E   ImportError: 
> dlopen(/usr/local/lib/python3.9/site-packages/pyarrow/lib.cpython-39-darwin.so,
>  2): Symbol not found: __ZN6google8protobuf8internal16InternalMetadataD1Ev
> E     Referenced from: /usr/local/lib/libarrow.900.dylib
> E     Expected in: flat namespace
> E    in /usr/local/lib/libarrow.900.dylib
>  Interrupted: 1 error during collection 
> 
> === 1 error in 5.18s 
> ===
> Error: Process completed with exit code 2. {code}
> See an example of build on arrow/master branch: 
> https://github.com/apache/arrow/runs/7385879183?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17110) [C++] Move away from C++11

2022-07-22 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570100#comment-17570100
 ] 

Weston Pace commented on ARROW-17110:
-

Just for consideration, is the following policy possible?

"We do not release new versions for R < 4 but we will consider backporting 
critical security issues"

I'm not sure if that would be more or less work than sprinkling more 
ifdef/checks throughout our code base.

> [C++] Move away from C++11
> --
>
> Key: ARROW-17110
> URL: https://issues.apache.org/jira/browse/ARROW-17110
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: H. Vetinari
>Priority: Major
>
> The upcoming abseil release has dropped support for C++11, so 
> {_}eventually{_}, arrow will have to follow. More details 
> [here|https://github.com/conda-forge/abseil-cpp-feedstock/issues/37].
> Relatedly, when I 
> [tried|https://github.com/conda-forge/abseil-cpp-feedstock/pull/25] to switch 
> abseil to a newer C++ version on windows, things apparently broke in arrow 
> CI. This is because the ABI of abseil is sensitive to the C++ standard that's 
> used to compile, and google only supports a homogeneous version to compile 
> all artefacts in a stack. This creates some friction with conda-forge (where 
> the compilers are generally much newer than what arrow might be willing to 
> impose). For now, things seems to have worked out with arrow 
> [specifying|https://github.com/apache/arrow/blob/897a4c0ce73c3fe07872beee2c1d2128e44f6dd4/cpp/cmake_modules/SetupCxxFlags.cmake#L121-L124]
>  C\+\+11 while conda-forge moved to C\+\+17 - at least on unix, but windows 
> was not so lucky.
> Perhaps people would therefore also be interested in collaborating (or at 
> least commenting on) this 
> [issue|https://github.com/conda-forge/abseil-cpp-feedstock/issues/29], which 
> should permit more flexibility by being able to opt into given standard 
> versions also from conda-forge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17186) [C++][Docs] Verify Completeness of API Reference

2022-07-22 Thread Kae Suarez (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kae Suarez updated ARROW-17186:
---
Component/s: C++
 Documentation

> [C++][Docs] Verify Completeness of API Reference
> 
>
> Key: ARROW-17186
> URL: https://issues.apache.org/jira/browse/ARROW-17186
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Documentation
>Reporter: Kae Suarez
>Priority: Major
>
> At least the Datum class is missing methods in the API documentation, and any 
> other incompleteness needs to be found and addressed. This ticket is for 
> hunting down incompleteness, and filling in the gaps.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17186) [C++][Docs] Verify Completeness of API Reference

2022-07-22 Thread Kae Suarez (Jira)
Kae Suarez created ARROW-17186:
--

 Summary: [C++][Docs] Verify Completeness of API Reference
 Key: ARROW-17186
 URL: https://issues.apache.org/jira/browse/ARROW-17186
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Kae Suarez


At least the Datum class is missing methods in the API documentation, and any 
other incompleteness needs to be found and addressed. This ticket is for 
hunting down incompleteness, and filling in the gaps.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns

2022-07-22 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace resolved ARROW-16700.
-
Resolution: Fixed

Issue resolved by pull request 13518
[https://github.com/apache/arrow/pull/13518]

> [C++] [R] [Datasets] aggregates on partitioning columns
> ---
>
> Key: ARROW-16700
> URL: https://issues.apache.org/jira/browse/ARROW-16700
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Jonathan Keane
>Assignee: Jeroen van Straten
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 8.0.2, 9.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> When summarizing a whole dataset (without group_by) with an aggregate, and 
> summarizing a partitioned column, arrow returns wrong data:
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> df <- expand.grid(
>   some_nulls = c(0L, 1L, 2L),
>   year = 2010:2023,
>   month = 1:12,
>   day = 1:30
> )
> path <- tempfile()
> dir.create(path)
> write_dataset(df, path, partitioning = c("year", "month"))
> ds <- open_dataset(path)
> # with arrow the mins/maxes are off for partitioning columns
> ds %>%
>   summarise(n = n(), min_year = min(year), min_month = min(month), min_day = 
> min(day), max_year = max(year), max_month = max(month), max_day = max(day)) 
> %>% 
>   collect()
> #> # A tibble: 1 × 7
> #>   n min_year min_month min_day max_year max_month max_day
> #>   
> #> 1 15120 2023 1   1 202312  30
> # comapred to what we get with dplyr
> df %>%
>   summarise(n = n(), min_year = min(year), min_month = min(month), min_day = 
> min(day), max_year = max(year), max_month = max(month), max_day = max(day)) 
> %>% 
>   collect()
> #>   n min_year min_month min_day max_year max_month max_day
> #> 1 15120 2010 1   1 202312  30
> # even min alone is off:
> ds %>%
>   summarise(min_year = min(year)) %>% 
>   collect()
> #> # A tibble: 1 × 1
> #>   min_year
> #>  
> #> 1 2016
>   
> # but non-partitioning columns are fine:
> ds %>%
>   summarise(min_day = min(day)) %>% 
>   collect()
> #> # A tibble: 1 × 1
> #>   min_day
> #> 
> #> 1   1
>   
>   
> # But with a group_by, this seems ok
> ds %>%
>   group_by(some_nulls) %>%
>   summarise(min_year = min(year)) %>% 
>   collect()
> #> # A tibble: 3 × 2
> #>   some_nulls min_year
> #>
> #> 1  0 2010
> #> 2  1 2010
> #> 3  2 2010
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-07-22 Thread Jacob Wujciak-Jens (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570126#comment-17570126
 ] 

Jacob Wujciak-Jens commented on ARROW-15678:


That was my impression: 
[issue|https://github.com/Homebrew/homebrew-core/issues/94724] and 
[PR|https://github.com/Homebrew/homebrew-core/pull/94958] in homebrew-core. 
Maybe [~jonkeane] can confirm?


> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17187) [R] Improve lazy ALTREP implementation for String

2022-07-22 Thread Dewey Dunnington (Jira)
Dewey Dunnington created ARROW-17187:


 Summary: [R] Improve lazy ALTREP implementation for String
 Key: ARROW-17187
 URL: https://issues.apache.org/jira/browse/ARROW-17187
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Dewey Dunnington


ARROW-16578 noted that there was a high cost to looping through an ALTREP 
character vector that we created in the arrow R package. The temporary 
workaround is to materialize whenever the first element is requested, which is 
much faster than our initial implementation but is probably not necessary given 
that other ALTREP character implementations appear to not have this issue:

(Timings before merging ARROW-16578, which reduces the 5 second operation below 
to 0.05 seconds).

{code:R}
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for 
more information.

df1 <- tibble::tibble(x=as.character(floor(runif(100) * 20)))
write_parquet(df1,"/tmp/test.parquet")
df2 <- read_parquet("/tmp/test.parquet")
system.time(unique(df1$x))
#>user  system elapsed 
#>   0.022   0.001   0.023
system.time(unique(df2$x))
#>user  system elapsed 
#>   4.529   0.680   5.226

# the speed is almost certainly not due to ALTREP itself
# but is probably something to do with our implementation
tf <- tempfile()
readr::write_csv(df1, tf)
df3 <- vroom::vroom(tf, delim = ",", altrep = TRUE)
#> Rows: 100 Columns: 1
#> ── Column specification 

#> Delimiter: ","
#> dbl (1): x
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this 
message.
.Internal(inspect(df3$x))
#> @2d2042048 14 REALSXP g0c0 [REF(65535)] vroom_dbl (len=100, 
materialized=F)
system.time(unique(df3$x))
#>user  system elapsed 
#>   0.127   0.001   0.128
.Internal(inspect(df3$x))
#> @2d2042048 14 REALSXP g1c0 [MARK,REF(65535)] vroom_dbl (len=100, 
materialized=F)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16578) [R] unique() and is.na() on a column of a tibble is much slower after writing to and reading from a parquet file

2022-07-22 Thread Dewey Dunnington (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dewey Dunnington resolved ARROW-16578.
--
Resolution: Fixed

Issue resolved by pull request 13415
[https://github.com/apache/arrow/pull/13415]

> [R] unique() and is.na() on a column of a tibble is much slower after writing 
> to and reading from a parquet file
> 
>
> Key: ARROW-16578
> URL: https://issues.apache.org/jira/browse/ARROW-16578
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 7.0.0, 8.0.0
>Reporter: Hideaki Hayashi
>Assignee: Hideaki Hayashi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> unique() on a column of a tibble is much slower after writing to and reading 
> from a parquet file.
> Here is a reprex.
> {{df1 <- tibble::tibble(x=as.character(floor(runif(100) * 20)))}}
> {{write_parquet(df1,"/tmp/test.parquet")}}
> {{df2 <- read_parquet("/tmp/test.parquet")}}
> {{system.time(unique(df1$x))}}
> {{# Result on my late 2020 macbook pro with M1 processor:}}
> {{#   user  system elapsed }}
> {{#  0.020   0.000   0.021 }}
> {{system.time(unique(df2$x))}}
> {{#   user  system elapsed }}
> {{#  5.230   0.419   5.649 }}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17185) [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and ExecPlan$BuildAndShow()

2022-07-22 Thread Dewey Dunnington (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dewey Dunnington updated ARROW-17185:
-
Description: 
When adding a print method for ExecPlans in R we chose to copy some of the code 
from {{ExecPlan_prepare}} to {{Exec_Plan_BuildAndShow}} and from 
{{ExecPlan$Run()}} to {{{}ExecPlan$BuildAndShow(){}}}. 

Relevant PR: https://github.com/apache/arrow/pull/13541 (ARROW-15016)

  was:
When adding a print method for ExecPlans in R we chose to copy some of the code 
from {{ExecPlan_prepare}} to {{Exec_Plan_BuildAndShow}} and from 
{{ExecPlan$Run()}} to {{{}ExecPlan$BuildAndShow(){}}}. 

Relevant PR: https://github.com/apache/arrow/pull/13541


> [R] [C++] Resolve code duplication in ExecPlan_BuildAndShow and 
> ExecPlan$BuildAndShow()
> ---
>
> Key: ARROW-17185
> URL: https://issues.apache.org/jira/browse/ARROW-17185
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Affects Versions: 8.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> When adding a print method for ExecPlans in R we chose to copy some of the 
> code from {{ExecPlan_prepare}} to {{Exec_Plan_BuildAndShow}} and from 
> {{ExecPlan$Run()}} to {{{}ExecPlan$BuildAndShow(){}}}. 
> Relevant PR: https://github.com/apache/arrow/pull/13541 (ARROW-15016)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-15016) [R] show_exec_plan() for an arrow_dplyr_query

2022-07-22 Thread Dewey Dunnington (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dewey Dunnington resolved ARROW-15016.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13541
[https://github.com/apache/arrow/pull/13541]

> [R] show_exec_plan() for an arrow_dplyr_query
> -
>
> Key: ARROW-15016
> URL: https://issues.apache.org/jira/browse/ARROW-15016
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 18.5h
>  Remaining Estimate: 0h
>
> {*}Proposed approach{*}: [design 
> doc|https://docs.google.com/document/d/1Ep8aV4jDsNCCy9uv1bjWY_JF17nzHQogv0EnGJvraQI/edit#]
> *Steps*
>  * Read about ExecPlan and ExecPlan::ToString
>  ** https://issues.apache.org/jira/browse/ARROW-14233
>  ** https://issues.apache.org/jira/browse/ARROW-15138
>  ** https://issues.apache.org/jira/browse/ARROW-13785
>  * Hook up to the existing C++ ToString method for ExecPlans 
>  * Implement a {{ToString()}} method for {{ExecPlan}} R6 class
>  * Implement and document {{show_exec_plan()}}
> {*}Original description{*}:
> Now that we can print a query plan (ARROW-13785) we should wire this up in R 
> so we can see what execution plans are being put together for various queries 
> (like the TPC-H queries)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17188) [R] Update news for 9.0.0

2022-07-22 Thread Will Jones (Jira)
Will Jones created ARROW-17188:
--

 Summary: [R] Update news for 9.0.0
 Key: ARROW-17188
 URL: https://issues.apache.org/jira/browse/ARROW-17188
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Affects Versions: 9.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 9.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17121) [Gandiva][C++] Adding mask function

2022-07-22 Thread Palak Pariawala (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Palak Pariawala resolved ARROW-17121.
-
Resolution: Fixed

> [Gandiva][C++] Adding mask function
> ---
>
> Key: ARROW-17121
> URL: https://issues.apache.org/jira/browse/ARROW-17121
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva
>Reporter: Palak Pariawala
>Assignee: Palak Pariawala
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Add mask(str inp[, str uc-mask[, str lc-mask[, str num-mask]]]) function to 
> Gandiva.
> With default masking upper case letters as 'X', lower case letters as 'x' and 
> numbers as 'n'.
> Custom masking as optionally specified in parameters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-17121) [Gandiva][C++] Adding mask function

2022-07-22 Thread Palak Pariawala (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Palak Pariawala closed ARROW-17121.
---

> [Gandiva][C++] Adding mask function
> ---
>
> Key: ARROW-17121
> URL: https://issues.apache.org/jira/browse/ARROW-17121
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva
>Reporter: Palak Pariawala
>Assignee: Palak Pariawala
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Add mask(str inp[, str uc-mask[, str lc-mask[, str num-mask]]]) function to 
> Gandiva.
> With default masking upper case letters as 'X', lower case letters as 'x' and 
> numbers as 'n'.
> Custom masking as optionally specified in parameters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17088) [R] Use `.arrow` as extension of IPC files of datasets

2022-07-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17088:
---
Labels: pull-request-available  (was: )

> [R] Use `.arrow` as extension of IPC files of datasets
> --
>
> Key: ARROW-17088
> URL: https://issues.apache.org/jira/browse/ARROW-17088
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: SHIMA Tatsuya
>Assignee: Kazuyuki Ura
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Related to ARROW-17072
> As noted in the following document, the recommended extension for IPC files 
> is now `.arrow`.
> > We recommend the “.arrow” extension for files created with this format.
> https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
> However, currently when writing a dataset with the {{write_dataset}} 
> function, the default extension is {{.feather}} when {{feather}} is selected 
> as the format, and {{.ipc}} when {{ipc}} is selected.
> https://github.com/apache/arrow/blob/f295da4cfdcf102d9ac2d16bbca6f8342fc3e6a8/r/R/dataset-write.R#L124-L126



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16395) [R] Implement lubridate's parsers with year, month, and day, hour, minute, and second components

2022-07-22 Thread Dewey Dunnington (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dewey Dunnington resolved ARROW-16395.
--
Resolution: Fixed

Issue resolved by pull request 13627
[https://github.com/apache/arrow/pull/13627]

> [R] Implement lubridate's parsers with year, month, and day, hour, minute, 
> and second components
> 
>
> Key: ARROW-16395
> URL: https://issues.apache.org/jira/browse/ARROW-16395
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Parse date-times with year, month, and day, hour, minute, and second 
> components:
> ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() 
> mdy_h() ydm_hms() ydm_hm() ydm_h()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-07-22 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570163#comment-17570163
 ] 

Jonathan Keane commented on ARROW-15678:


Homebrew only accepted that as a temporary workaround and has threatened to 
turn off optimizations if we don't resolve this. They haven't yet followed 
through yet, though.

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-07-22 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570164#comment-17570164
 ] 

Antoine Pitrou commented on ARROW-15678:


Note that I suggested a perhaps more acceptable workaround above.

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-07-22 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570163#comment-17570163
 ] 

Jonathan Keane edited comment on ARROW-15678 at 7/22/22 7:28 PM:
-

Homebrew only accepted that as a temporary workaround and has threatened to 
turn off optimizations if we don't resolve this. They haven't yet followed 
through yet, though. 
https://github.com/Homebrew/homebrew-core/issues/94724#issuecomment-1063031123 


was (Author: jonkeane):
Homebrew only accepted that as a temporary workaround and has threatened to 
turn off optimizations if we don't resolve this. They haven't yet followed 
through yet, though.

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17177) [C++][Docs] Re-Organize the Existing ACERO Streaming Engine Documentation

2022-07-22 Thread Aldrin Montana (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570179#comment-17570179
 ] 

Aldrin Montana commented on ARROW-17177:


Linking related JIRAs for improving Acero documentation and updating the 
overview of components and layers

> [C++][Docs] Re-Organize the Existing ACERO Streaming Engine Documentation
> -
>
> Key: ARROW-17177
> URL: https://issues.apache.org/jira/browse/ARROW-17177
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Documentation
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> The current document is too-length. By creating sub-pages for each example 
> and explain the code and provide a better description, that would be much 
> better in terms of readability and browsing the content. The idea is to 
> create a sub-folder in the examples called 'acero` and include each example 
> in a separate `.cc` file. This is the code change. Following this, the 
> documentation page on the website can be splitted into sub-pages. This is the 
> only change suggested for this sub-task. 
> There is already a JIRA: https://issues.apache.org/jira/browse/ARROW-16802 to 
> improve the internal content. So it would be used for re-writing the contnet. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17115) [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

2022-07-22 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace resolved ARROW-17115.
-
Resolution: Fixed

Issue resolved by pull request 13679
[https://github.com/apache/arrow/pull/13679]

> [C++] HashJoin fails if it encounters a batch with more than 32Ki rows
> --
>
> Key: ARROW-17115
> URL: https://issues.apache.org/jira/browse/ARROW-17115
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> The new swiss join assumes that batches are being broken according to the 
> morsel/batch model and it assumes those batches have, at most, 32Ki rows 
> (signed 16-bit indices are used in various places).
> However, we are not currently slicing all of our inputs to batches this 
> small.  This is causing conbench to fail and would likely be a problem with 
> any large inputs.
> We should fix this by slicing batches in the engine to the appropriate 
> maximum size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17189) [Python][Docs] Nightly build instructions install release version

2022-07-22 Thread Todd Farmer (Jira)
Todd Farmer created ARROW-17189:
---

 Summary: [Python][Docs] Nightly build instructions install release 
version
 Key: ARROW-17189
 URL: https://issues.apache.org/jira/browse/ARROW-17189
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation, Python
Affects Versions: 8.0.0
Reporter: Todd Farmer


The [Python installation 
documentation|https://arrow.apache.org/docs/python/install.html] provides the 
following instructions to instal nightly builds of pyarrow:
{quote}{{Install the development version of PyArrow from 
[arrow-nightlies|https://anaconda.org/arrow-nightlies/pyarrow] conda channel:}}
{quote}
{quote}{{conda install -c arrow-nightlies pyarrow}}{quote}
The result of this seems to be installation of the release version, not a 
nightly build:
{code:java}
 (test-nightlies) todd@pop-os:~/arrow/docs$ python -c "import pyarrow; 
pyarrow.show_versions()"
Traceback (most recent call last):
  File "", line 1, in 
ModuleNotFoundError: No module named 'pyarrow'
(test-nightlies) todd@pop-os:~/arrow/docs$ conda install -c arrow-nightlies 
pyarrow
Collecting package metadata (current_repodata.json): done
Solving environment: done## Package Plan ##  environment location: 
/home/todd/miniconda3/envs/test-nightlies  added / updated specs:
    - pyarrow
The following NEW packages will be INSTALLED:  abseil-cpp         
pkgs/main/linux-64::abseil-cpp-20211102.0-hd4dd3e8_0
  arrow-cpp          pkgs/main/linux-64::arrow-cpp-8.0.0-py310h3098874_0
  aws-c-common       pkgs/main/linux-64::aws-c-common-0.4.57-he6710b0_1
  aws-c-event-stream pkgs/main/linux-64::aws-c-event-stream-0.1.6-h2531618_5
  aws-checksums      pkgs/main/linux-64::aws-checksums-0.1.9-he6710b0_0
  aws-sdk-cpp        pkgs/main/linux-64::aws-sdk-cpp-1.8.185-hce553d0_0
  blas               pkgs/main/linux-64::blas-1.0-mkl
  boost-cpp          pkgs/main/linux-64::boost-cpp-1.73.0-h7f8727e_12
  brotli             pkgs/main/linux-64::brotli-1.0.9-he6710b0_2
  c-ares             pkgs/main/linux-64::c-ares-1.18.1-h7f8727e_0
  gflags             pkgs/main/linux-64::gflags-2.2.2-he6710b0_0
  glog               pkgs/main/linux-64::glog-0.5.0-h2531618_0
  grpc-cpp           pkgs/main/linux-64::grpc-cpp-1.46.1-h33aed49_0
  icu                pkgs/main/linux-64::icu-58.2-he6710b0_3
  intel-openmp       pkgs/main/linux-64::intel-openmp-2021.4.0-h06a4308_3561
  krb5               pkgs/main/linux-64::krb5-1.19.2-hac12032_0
  libboost           pkgs/main/linux-64::libboost-1.73.0-h28710b8_12
  libcurl            pkgs/main/linux-64::libcurl-7.82.0-h0b77cf5_0
  libedit            pkgs/main/linux-64::libedit-3.1.20210910-h7f8727e_0
  libev              pkgs/main/linux-64::libev-4.33-h7f8727e_1
  libevent           pkgs/main/linux-64::libevent-2.1.12-h8f2d780_0
  libnghttp2         pkgs/main/linux-64::libnghttp2-1.46.0-hce63b2e_0
  libprotobuf        pkgs/main/linux-64::libprotobuf-3.20.1-h4ff587b_0
  libssh2            pkgs/main/linux-64::libssh2-1.10.0-h8f2d780_0
  libthrift          pkgs/main/linux-64::libthrift-0.15.0-hcc01f38_0
  lz4-c              pkgs/main/linux-64::lz4-c-1.9.3-h295c915_1
  mkl                pkgs/main/linux-64::mkl-2021.4.0-h06a4308_640
  mkl-service        pkgs/main/linux-64::mkl-service-2.4.0-py310h7f8727e_0
  mkl_fft            pkgs/main/linux-64::mkl_fft-1.3.1-py310hd6ae3a3_0
  mkl_random         pkgs/main/linux-64::mkl_random-1.2.2-py310h00e6091_0
  numpy              pkgs/main/linux-64::numpy-1.22.3-py310hfa59a62_0
  numpy-base         pkgs/main/linux-64::numpy-base-1.22.3-py310h9585f30_0
  orc                pkgs/main/linux-64::orc-1.7.4-h07ed6aa_0
  pyarrow            pkgs/main/linux-64::pyarrow-8.0.0-py310h468efa6_0
  re2                pkgs/main/linux-64::re2-2022.04.01-h295c915_0
  snappy             pkgs/main/linux-64::snappy-1.1.9-h295c915_0
  utf8proc           pkgs/main/linux-64::utf8proc-2.6.1-h27cfd23_0
  zstd               pkgs/main/linux-64::zstd-1.5.2-ha4553b6_0
Proceed ([y]/n)? yPreparing transaction: done
Verifying transaction: done
Executing transaction: done
(test-nightlies) todd@pop-os:~/arrow/docs$ python -c "import pyarrow; 
pyarrow.show_versions()"
pyarrow version info

Package kind              : not indicated
Arrow C++ library version : 8.0.0   
Arrow C++ compiler        : GNU 11.2.0
Arrow C++ compiler flags  : -fvisibility-inlines-hidden -std=c++17 
-fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC 
-fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem 
/home/todd/miniconda3/envs/test-nightlies/include 
-fdebug-prefix-map=/opt/conda/conda-bld/arrow-cpp_1657131305338/work=/usr/local/src/conda/arrow-cpp-8.0.0
 
-fdebug-prefix-map=/home/todd/miniconda3/envs/test-nightlies=/usr/local/src/conda-prefix
 -fdiagnostics-color=always -O3 -DNDEBUG
Arrow C++ git revision    :         
Arrow C++ git description :         
Arrow C++ build ty

[jira] [Commented] (ARROW-17189) [Python][Docs] Nightly build instructions install release version

2022-07-22 Thread Todd Farmer (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570204#comment-17570204
 ] 

Todd Farmer commented on ARROW-17189:
-

It's worth noting that the [instructions on 
anaconda.org|https://anaconda.org/arrow-nightlies/repo/installers?type=conda&label=main]
 differ, but are not successful:
{code:java}
(test-nightlies) todd@pop-os:~/arrow/docs$ conda install --channel 
"arrow-nightlies" package
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible 
solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible 
solve.PackagesNotFoundError: The following packages are not available from 
current channels:  - packageCurrent channels:  - 
https://conda.anaconda.org/arrow-nightlies/linux-64
  - https://conda.anaconda.org/arrow-nightlies/noarch
  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarchTo search for alternate channels 
that may provide the conda package you're
looking for, navigate to    https://anaconda.organd use the search bar at the 
top of the page.
(test-nightlies) todd@pop-os:~/arrow/docs$  {code}
It may be expected that experienced conda users will know how to work around 
this; that assumption does not apply to me. ;)

> [Python][Docs] Nightly build instructions install release version
> -
>
> Key: ARROW-17189
> URL: https://issues.apache.org/jira/browse/ARROW-17189
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Python
>Affects Versions: 8.0.0
>Reporter: Todd Farmer
>Priority: Minor
>
> The [Python installation 
> documentation|https://arrow.apache.org/docs/python/install.html] provides the 
> following instructions to instal nightly builds of pyarrow:
> {quote}{{Install the development version of PyArrow from 
> [arrow-nightlies|https://anaconda.org/arrow-nightlies/pyarrow] conda 
> channel:}}
> {quote}
> {quote}{{conda install -c arrow-nightlies pyarrow}}{quote}
> The result of this seems to be installation of the release version, not a 
> nightly build:
> {code:java}
>  (test-nightlies) todd@pop-os:~/arrow/docs$ python -c "import pyarrow; 
> pyarrow.show_versions()"
> Traceback (most recent call last):
>   File "", line 1, in 
> ModuleNotFoundError: No module named 'pyarrow'
> (test-nightlies) todd@pop-os:~/arrow/docs$ conda install -c arrow-nightlies 
> pyarrow
> Collecting package metadata (current_repodata.json): done
> Solving environment: done## Package Plan ##  environment location: 
> /home/todd/miniconda3/envs/test-nightlies  added / updated specs:
>     - pyarrow
> The following NEW packages will be INSTALLED:  abseil-cpp         
> pkgs/main/linux-64::abseil-cpp-20211102.0-hd4dd3e8_0
>   arrow-cpp          pkgs/main/linux-64::arrow-cpp-8.0.0-py310h3098874_0
>   aws-c-common       pkgs/main/linux-64::aws-c-common-0.4.57-he6710b0_1
>   aws-c-event-stream pkgs/main/linux-64::aws-c-event-stream-0.1.6-h2531618_5
>   aws-checksums      pkgs/main/linux-64::aws-checksums-0.1.9-he6710b0_0
>   aws-sdk-cpp        pkgs/main/linux-64::aws-sdk-cpp-1.8.185-hce553d0_0
>   blas               pkgs/main/linux-64::blas-1.0-mkl
>   boost-cpp          pkgs/main/linux-64::boost-cpp-1.73.0-h7f8727e_12
>   brotli             pkgs/main/linux-64::brotli-1.0.9-he6710b0_2
>   c-ares             pkgs/main/linux-64::c-ares-1.18.1-h7f8727e_0
>   gflags             pkgs/main/linux-64::gflags-2.2.2-he6710b0_0
>   glog               pkgs/main/linux-64::glog-0.5.0-h2531618_0
>   grpc-cpp           pkgs/main/linux-64::grpc-cpp-1.46.1-h33aed49_0
>   icu                pkgs/main/linux-64::icu-58.2-he6710b0_3
>   intel-openmp       pkgs/main/linux-64::intel-openmp-2021.4.0-h06a4308_3561
>   krb5               pkgs/main/linux-64::krb5-1.19.2-hac12032_0
>   libboost           pkgs/main/linux-64::libboost-1.73.0-h28710b8_12
>   libcurl            pkgs/main/linux-64::libcurl-7.82.0-h0b77cf5_0
>   libedit            pkgs/main/linux-64::libedit-3.1.20210910-h7f8727e_0
>   libev              pkgs/main/linux-64::libev-4.33-h7f8727e_1
>   libevent           pkgs/main/linux-64::libevent-2.1.12-h8f2d780_0
>   libnghttp2         pkgs/main/linux-64::libnghttp2-1.46.0-hce63b2e_0
>   libprotobuf        pkgs/main/linux-64::libprotobuf-3.20.1-h4ff587b_0
>   libssh2            pkgs/main/linux-64::libssh2-1.10.0-h8f2d780_0
>   libthrift          pkgs/main/linux-64::libthrift-0.15.0-hcc01f38_0
>   lz4-c              pkgs/main/linux-64::lz4-c-1.9.3-h295c915_1
>   mkl                pkgs/main/linux-64::mkl-2021.4.0-h06a4308_640
>   mkl-service        pkgs/mai

[jira] [Updated] (ARROW-16692) [C++] StackOverflow in merge generator causes segmentation fault in scan

2022-07-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16692:
---
Labels: pull-request-available  (was: )

> [C++] StackOverflow in merge generator causes segmentation fault in scan
> 
>
> Key: ARROW-16692
> URL: https://issues.apache.org/jira/browse/ARROW-16692
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jonathan Keane
>Assignee: Weston Pace
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
> Attachments: backtrace.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I'm still working to make a minimal reproducer for this, though I can 
> reliably reproduce it below (though that means needing to download a bunch of 
> data first...). I've cleaned out much of the unnecessary code (so this query 
> below is a bit silly, and not what I'm actually trying to do), but haven't 
> been able to make a constructed dataset that reproduces this.
> Working on some example with the new | more cleaned taxi dataset at 
> {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault:
> {code}
> library(arrow)
> library(dplyr)
> ds <- open_dataset("path/to/new_taxi/")
> ds %>%
>   filter(!is.na(pickup_location_id)) %>%
>   summarise(n = n()) %>% collect()
> {code}
> Most of the time ends in a segfault (though I have gotten it to work on 
> occasion). I've tried with smaller files | constructed datasets and haven't 
> been able to replicate it yet. One thing that might be important is:  
> {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or 
> so.
> I've attached a backtrace in case that's enough to see what's going on here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16692) [C++] StackOverflow in merge generator causes segmentation fault in scan

2022-07-22 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570214#comment-17570214
 ] 

Weston Pace commented on ARROW-16692:
-

So the trigger seems to be a large sequence of "empty" files.  Either the files 
are truly empty or (I think) it could be that a pushdown filter of some kind 
eliminated all of the rows in the file.  This seems to align with [~jonkeane]'s 
reproducer, especially the "One thing that might be important is: 
pickup_location_id is all NAs | nulls in the first 8 years of the data or so." 
part.

The merge generator logic roughly boils down to...

{code}
def get_next_batch():
  if current_file is None:
current_file = get_next_file()
  return get_next_batch_from_file(current_file)

def get_next_batch_from_file(file):
  batch = file.read_batch()
  if not batch:
current_file = None
return get_next_batch()
  return batch
{code}

The new code looks something like...

{code}
def get_next_batch():
  while True:
if current_file is None:
  current_file = get_next_file()
batch = get_next_batch_from_file(current_file)
if batch:
  return batch

def get_next_batch_from_file(file):
  batch = file.read_batch()
  if not batch:
current_file = None
return None
  return batch
{code}

However, because this is all async, the actual code change looks significantly 
messier.  Sometimes we call {{get_next_batch_from_file}} directly instead of 
going through {{get_next_batch}} and so we need a new flag to distinguish 
between the two cases:

{code}
def get_next_batch_from_file(file, recursive):
  batch = file.read_batch()
  if not batch:
current_file = None
if recursive:
  return None
return get_next_batch()
  return batch
{code}


> [C++] StackOverflow in merge generator causes segmentation fault in scan
> 
>
> Key: ARROW-16692
> URL: https://issues.apache.org/jira/browse/ARROW-16692
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jonathan Keane
>Assignee: Weston Pace
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
> Attachments: backtrace.txt
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I'm still working to make a minimal reproducer for this, though I can 
> reliably reproduce it below (though that means needing to download a bunch of 
> data first...). I've cleaned out much of the unnecessary code (so this query 
> below is a bit silly, and not what I'm actually trying to do), but haven't 
> been able to make a constructed dataset that reproduces this.
> Working on some example with the new | more cleaned taxi dataset at 
> {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault:
> {code}
> library(arrow)
> library(dplyr)
> ds <- open_dataset("path/to/new_taxi/")
> ds %>%
>   filter(!is.na(pickup_location_id)) %>%
>   summarise(n = n()) %>% collect()
> {code}
> Most of the time ends in a segfault (though I have gotten it to work on 
> occasion). I've tried with smaller files | constructed datasets and haven't 
> been able to replicate it yet. One thing that might be important is:  
> {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or 
> so.
> I've attached a backtrace in case that's enough to see what's going on here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability

2022-07-22 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570217#comment-17570217
 ] 

Weston Pace commented on ARROW-17183:
-

{blockquote}
Specifically about this case, we assume that this Fetch and Sort operations are 
the most external relations in a Substrait plan. Meaning the Sort or Fetch 
operation is called at the end of the other operations. This is not a very 
accurate representation. First we need to understand if this is the general 
case. cc Weston Pace Jeroen van Straten 
{blockquote}

This is the case in TPC-H.  It also tends to be the case in real-world queries 
as it is rather tricky / impossible to write a mid-plan sort in SQL.  
Subqueries / window functions are the most likely place where you would see a 
mid-plan sort and we don't support either at the moment.

{blockquote}
Another approach is that we define a sink node which can execute a function 
which does the expected operation. In some of the defined Sink nodes (KSelect, 
Sort) there is a function called `DoFinish`. We should be able to call a custom 
function within this call. So from Substrait end when we extract the plan, then 
we can write the required `std::function` which would be an option for this 
custom sink node. And we assume that a table as input and write the logic. This 
way we don't have to introduce new nodes. And what if there are different 
capabilities users need and ACERO has a limitation, can we always keep adding 
nodes to fullfil that? I am not so sure. This is just a high level thought 
process.
{blockquote}

This special sink node would have to collect all of the data in memory first.

{blockquote}
Although I have implemented a SortAndFetch node which can perform the fetch 
followed by a sort just by following what is being done in Sort and SelectK 
nodes. But I am not exactly sure any of these approaches are optimized or the 
best way to solve the problem.  
{blockquote}

The biggest general problem here is that a top-k node should not have to 
collect all data in memory.  It does have to scan all data but it should be 
able to throw away data that is obviously larger than K.  A SortAndFetch node 
should also not have to collect all data in memory.  Our current implementation 
does.  So what you've described is no worse than our current situation. Yet it 
is definitely something we should improve at some point.  There is ARROW-14202 
to improve the top-k node.  We could improve SortAndFetch at that time as well. 
 CC [~ArianaVillegas] as this might be something she wants to consider when 
addressing ARROW-14202 (i.e. we don't just need top-k we need top-k-skipping-m)

> [C++] Adding ExecNode with Sort and Fetch capability
> 
>
> Key: ARROW-17183
> URL: https://issues.apache.org/jira/browse/ARROW-17183
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In Substrait integrations with ACERO, a functionality required is the ability 
> to fetch records sorted and unsorted.
> Fetch operation is defined as selecting `K` number of records with an offset. 
> For instance pick 10 records skipping the first 5 elements. Here we can 
> define this as a Slice operation and records can be easily extracted in a 
> sink-node. 
> Sort and Fetch operation applies when we need to execute a Fetch operation on 
> sorted data. The main issue is we cannot have a sort node followed by a 
> fetch. The reason is that all existing node definitions supporting sort are 
> based on sink nodes. Since there cannot be a node followed by sink, this 
> functionality has to take place in a single node. 
> But this is not a perfect solution for fetch and sort, but one way to do this 
> is define a sink node where the records are sorted and then a set of items 
> are fetched. 
> Another dilema is what if sort is followed by a fetch. In that case, there 
> has to be a flag to enable the order of the operations. 
> The objective of this ticket is to discuss a viable efficient solution and 
> include new nodes or a method to execute such a logic.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability

2022-07-22 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570217#comment-17570217
 ] 

Weston Pace edited comment on ARROW-17183 at 7/23/22 12:23 AM:
---

{quote}
Specifically about this case, we assume that this Fetch and Sort operations are 
the most external relations in a Substrait plan. Meaning the Sort or Fetch 
operation is called at the end of the other operations. This is not a very 
accurate representation. First we need to understand if this is the general 
case. cc Weston Pace Jeroen van Straten 
{quote}

This is the case in TPC-H.  It also tends to be the case in real-world queries 
as it is rather tricky / impossible to write a mid-plan sort in SQL.  
Subqueries / window functions are the most likely place where you would see a 
mid-plan sort and we don't support either at the moment.

{quote}
Another approach is that we define a sink node which can execute a function 
which does the expected operation. In some of the defined Sink nodes (KSelect, 
Sort) there is a function called `DoFinish`. We should be able to call a custom 
function within this call. So from Substrait end when we extract the plan, then 
we can write the required `std::function` which would be an option for this 
custom sink node. And we assume that a table as input and write the logic. This 
way we don't have to introduce new nodes. And what if there are different 
capabilities users need and ACERO has a limitation, can we always keep adding 
nodes to fullfil that? I am not so sure. This is just a high level thought 
process.
{quote}

This special sink node would have to collect all of the data in memory first.

{quote}
Although I have implemented a SortAndFetch node which can perform the fetch 
followed by a sort just by following what is being done in Sort and SelectK 
nodes. But I am not exactly sure any of these approaches are optimized or the 
best way to solve the problem.  
{quote}

The biggest general problem here is that a top-k node should not have to 
collect all data in memory.  It does have to scan all data but it should be 
able to throw away data that is obviously larger than K.  A SortAndFetch node 
should also not have to collect all data in memory.  Our current implementation 
does.  So what you've described is no worse than our current situation. Yet it 
is definitely something we should improve at some point.  There is ARROW-14202 
to improve the top-k node.  We could improve SortAndFetch at that time as well. 
 CC [~ArianaVillegas] as this might be something she wants to consider when 
addressing ARROW-14202 (i.e. we don't just need top-k we need top-k-skipping-m)


was (Author: westonpace):
{blockquote}
Specifically about this case, we assume that this Fetch and Sort operations are 
the most external relations in a Substrait plan. Meaning the Sort or Fetch 
operation is called at the end of the other operations. This is not a very 
accurate representation. First we need to understand if this is the general 
case. cc Weston Pace Jeroen van Straten 
{blockquote}

This is the case in TPC-H.  It also tends to be the case in real-world queries 
as it is rather tricky / impossible to write a mid-plan sort in SQL.  
Subqueries / window functions are the most likely place where you would see a 
mid-plan sort and we don't support either at the moment.

{blockquote}
Another approach is that we define a sink node which can execute a function 
which does the expected operation. In some of the defined Sink nodes (KSelect, 
Sort) there is a function called `DoFinish`. We should be able to call a custom 
function within this call. So from Substrait end when we extract the plan, then 
we can write the required `std::function` which would be an option for this 
custom sink node. And we assume that a table as input and write the logic. This 
way we don't have to introduce new nodes. And what if there are different 
capabilities users need and ACERO has a limitation, can we always keep adding 
nodes to fullfil that? I am not so sure. This is just a high level thought 
process.
{blockquote}

This special sink node would have to collect all of the data in memory first.

{blockquote}
Although I have implemented a SortAndFetch node which can perform the fetch 
followed by a sort just by following what is being done in Sort and SelectK 
nodes. But I am not exactly sure any of these approaches are optimized or the 
best way to solve the problem.  
{blockquote}

The biggest general problem here is that a top-k node should not have to 
collect all data in memory.  It does have to scan all data but it should be 
able to throw away data that is obviously larger than K.  A SortAndFetch node 
should also not have to collect all data in memory.  Our current implementation 
does.  So what you've described is no worse than our current situation. Yet it 
is definitely something we should 

[jira] [Commented] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability

2022-07-22 Thread Vibhatha Lakmal Abeykoon (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570221#comment-17570221
 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-17183:
--

This is a clear description on short term goals. I will go ahead and create two 
PRs for Fetch and SortAndFetch nodes. 

> [C++] Adding ExecNode with Sort and Fetch capability
> 
>
> Key: ARROW-17183
> URL: https://issues.apache.org/jira/browse/ARROW-17183
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In Substrait integrations with ACERO, a functionality required is the ability 
> to fetch records sorted and unsorted.
> Fetch operation is defined as selecting `K` number of records with an offset. 
> For instance pick 10 records skipping the first 5 elements. Here we can 
> define this as a Slice operation and records can be easily extracted in a 
> sink-node. 
> Sort and Fetch operation applies when we need to execute a Fetch operation on 
> sorted data. The main issue is we cannot have a sort node followed by a 
> fetch. The reason is that all existing node definitions supporting sort are 
> based on sink nodes. Since there cannot be a node followed by sink, this 
> functionality has to take place in a single node. 
> But this is not a perfect solution for fetch and sort, but one way to do this 
> is define a sink node where the records are sorted and then a set of items 
> are fetched. 
> Another dilema is what if sort is followed by a fetch. In that case, there 
> has to be a flag to enable the order of the operations. 
> The objective of this ticket is to discuss a viable efficient solution and 
> include new nodes or a method to execute such a logic.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17190) [C++] Create a Fetch Node

2022-07-22 Thread Vibhatha Lakmal Abeykoon (Jira)
Vibhatha Lakmal Abeykoon created ARROW-17190:


 Summary: [C++] Create a Fetch Node
 Key: ARROW-17190
 URL: https://issues.apache.org/jira/browse/ARROW-17190
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Vibhatha Lakmal Abeykoon
Assignee: Vibhatha Lakmal Abeykoon


The goal of the fetch node is to skip some records and select the given amount 
of records. 

This will be implemented as discussed in the parent issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability

2022-07-22 Thread Vibhatha Lakmal Abeykoon (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570222#comment-17570222
 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-17183:
--

May be just a fetch node with the sort capability which can be left optional.

> [C++] Adding ExecNode with Sort and Fetch capability
> 
>
> Key: ARROW-17183
> URL: https://issues.apache.org/jira/browse/ARROW-17183
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In Substrait integrations with ACERO, a functionality required is the ability 
> to fetch records sorted and unsorted.
> Fetch operation is defined as selecting `K` number of records with an offset. 
> For instance pick 10 records skipping the first 5 elements. Here we can 
> define this as a Slice operation and records can be easily extracted in a 
> sink-node. 
> Sort and Fetch operation applies when we need to execute a Fetch operation on 
> sorted data. The main issue is we cannot have a sort node followed by a 
> fetch. The reason is that all existing node definitions supporting sort are 
> based on sink nodes. Since there cannot be a node followed by sink, this 
> functionality has to take place in a single node. 
> But this is not a perfect solution for fetch and sort, but one way to do this 
> is define a sink node where the records are sorted and then a set of items 
> are fetched. 
> Another dilema is what if sort is followed by a fetch. In that case, there 
> has to be a flag to enable the order of the operations. 
> The objective of this ticket is to discuss a viable efficient solution and 
> include new nodes or a method to execute such a logic.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17190) [C++] Create a Fetch Node

2022-07-22 Thread Vibhatha Lakmal Abeykoon (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570223#comment-17570223
 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-17190:
--

In addition, it would be better to include the sort-ability as an optional 
feature. This would avoid adding another node just to take care of sort and 
fetch operation. Here we can also give a flag to decide the order of things, 
fetch and sort or sort and fetch. Node name could be just `Fetch`. 

> [C++] Create a Fetch Node
> -
>
> Key: ARROW-17190
> URL: https://issues.apache.org/jira/browse/ARROW-17190
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> The goal of the fetch node is to skip some records and select the given 
> amount of records. 
> This will be implemented as discussed in the parent issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)