[jira] [Updated] (ARROW-18337) [R] Possible undesirable handling of POSIXlt objects

2023-01-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18337:
---
Labels: pull-request-available  (was: )

> [R] Possible undesirable handling of POSIXlt objects
> 
>
> Key: ARROW-18337
> URL: https://issues.apache.org/jira/browse/ARROW-18337
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Danielle Navarro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In the course of updating documentation, I noticed that it is possible to 
> create an Arrow array of POSIXlt objects from R, but not a scalar. 
> https://github.com/apache/arrow/pull/14514#discussion_r1016078081
> This works:
> {code:r}
> tm <- as.POSIXlt(c(Sys.time(), Sys.time()))
> arrow::Array$create(tm)
> {code}
> This fails:
> {code:r}
> arrow::Scalar$create(as.POSIXlt(Sys.time()))
> {code}
> It's possible to manually convert a POSIXlt object to a struct scalar like 
> this:
> {code:r}
> df <- as.data.frame(unclass(as.POSIXlt(Sys.time(
> arrow::Scalar$create(df, 
>   type = struct(
> sec = float32(), 
> min = int32(),
> hour = int32(),
> mday = int32(),
> mon = int32(),
> year = int32(),
> wday = int32(),
> yday = int32(),
> isdst = int32(),
> zone = utf8(),
> gmtoff = int32()
>   ))
> {code}
> although this does not seem precisely the same as the behaviour of 
> Array$create() which creates an extension type?
> It was unclear to us ([~thisisnic] and myself) whether the current behaviour 
> was desirable, so it seemed sensible to open an issue! 
> Related issue:
> https://issues.apache.org/jira/browse/ARROW-18263



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16480) [R] Update read_csv_arrow and open_dataset parse_options, read_options, and convert_options to take lists

2023-01-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16480:
---
Labels: good-first-issue good-second-issue pull-request-available  (was: 
good-first-issue good-second-issue)

> [R] Update read_csv_arrow and open_dataset parse_options, read_options, and 
> convert_options to take lists
> -
>
> Key: ARROW-16480
> URL: https://issues.apache.org/jira/browse/ARROW-16480
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
>  Labels: good-first-issue, good-second-issue, 
> pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> From a discussion on a PR which documents the encoding argument 
> ([https://github.com/apache/arrow/pull/13038)]
> Currently if we want to specify Arrow-specific read options such as encoding, 
> we'd have to do something like this:
> {code:java}
> df <- read_csv_arrow(tf, read_options = CsvReadOptions$create(encoding = 
> "utf8")) {code}
> However, this uses a lower-level API that we don't want to include in the 
> examples for end-users to see.
>  
> We should update the code inside {{read_csv_arrow()}} so that the user can 
> specify {{read_options}} as a list which we then pass through to 
> CsvReadOptions internally, so we could instead call the much more 
> user-friendly code below:
> {code:java}
> df <- read_csv_arrow(tf, read_options = list(encoding = "utf8")) {code}
> We should then add an example of this to the function doc examples.
>  
> We also should do the same for parse_options and convert_options.
> Similarly, we can do:
> {code:r}
> open_dataset("data.csv", format = "csv", convert_options = 
> CsvConvertOptions$create(null_values = "Not Range", strings_can_be_null = 
> TRUE))%>% collect()
> {code}
> but it'd be great to be able to do:
> {code:r}
> open_dataset("data.csv", format = "csv", convert_options = list(null_values = 
> "Not Range", strings_can_be_null = TRUE))%>% collect()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18377) MIGRATION: Automate component labels from issue form content

2023-01-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18377:
---
Labels: gh-migration pull-request-available  (was: gh-migration)

> MIGRATION: Automate component labels from issue form content
> 
>
> Key: ARROW-18377
> URL: https://issues.apache.org/jira/browse/ARROW-18377
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Assignee: Jacob Wujciak
>Priority: Major
>  Labels: gh-migration, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-18364 added the ability to report issues in GitHub, and includes GitHub 
> issue templates with a drop-down component(s) selector. These form elements 
> drive resulting issue markdown only, and cannot dynamically drive issue 
> labels. This requires GitHub actions, which also have a few limitations. 
> First, the issue form does not produce any structured data, it only produces 
> the issue description markdown, so a parser is required. Second, ASF 
> restricts GitHub actions to a selection of approved actions. It is likely 
> that while community actions exist to generate structured data from issue 
> forms, the Apache Arrow project will need to write its own parser and label 
> application action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16795) [C#][Flight] Nightly verify-rc-source-csharp-macos-arm64 fails

2023-01-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16795:
---
Labels: pull-request-available  (was: )

> [C#][Flight] Nightly verify-rc-source-csharp-macos-arm64 fails
> --
>
> Key: ARROW-16795
> URL: https://issues.apache.org/jira/browse/ARROW-16795
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#, FlightRPC
>Affects Versions: 9.0.0
>Reporter: Raúl Cumplido
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The "verify-rc-source-csharp-macos-arm64" job has been failing on and off
> since ~may 18th, the issue seems to be with the flight tests.
> {code:java}
>  Failed Apache.Arrow.Flight.Tests.FlightTests.TestGetFlightMetadata [567 ms]
>   Error Message:
>    Grpc.Core.RpcException : Status(StatusCode="Internal", Detail="Error 
> starting gRPC call. HttpRequestException: An HTTP/2 connection could not be 
> established because the server did not complete the HTTP/2 handshake.", 
> DebugException="System.Net.Http.HttpRequestException: An HTTP/2 connection 
> could not be established because the server did not complete the HTTP/2 
> handshake.
>    at 
> System.Net.Http.HttpConnectionPool.ReturnHttp2Connection(Http2Connection 
> connection, Boolean isNewConnection)
>    at 
> System.Net.Http.HttpConnectionPool.AddHttp2ConnectionAsync(HttpRequestMessage 
> request)
>    at 
> System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object
>  s)
>    at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread 
> threadPoolThread, ExecutionContext executionContext, ContextCallback 
> callback, Object state)
>    at 
> System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread
>  threadPoolThread)
>    at 
> System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecuteFromThreadPool(Thread
>  threadPoolThread)
>    at System.Threading.ThreadPoolWorkQueue.Dispatch()
>    at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
>    at System.Threading.Thread.StartCallback()
> --- End of stack trace from previous location ---
>    at 
> System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken
>  cancellationToken)
>    at 
> System.Net.Http.HttpConnectionPool.GetHttp2ConnectionAsync(HttpRequestMessage 
> request, Boolean async, CancellationToken cancellationToken)
>    at 
> System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage
>  request, Boolean async, Boolean doRequestAuth, CancellationToken 
> cancellationToken)
>    at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, 
> Boolean async, CancellationToken cancellationToken)
>    at Grpc.Net.Client.Internal.GrpcCall`2.RunCall(HttpRequestMessage request, 
> Nullable`1 timeout)")
>   Stack Trace:
>      at Grpc.Net.Client.Internal.GrpcCall`2.GetResponseHeadersCoreAsync()
>    at 
> Apache.Arrow.Flight.Client.FlightClient.<>c.d.MoveNext() in 
> /Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/csharp/src/Apache.Arrow.Flight/Client/FlightClient.cs:line
>  71
> --- End of stack trace from previous location ---
>    at Apache.Arrow.Flight.Tests.FlightTests.TestGetFlightMetadata() in 
> /Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/csharp/test/Apache.Arrow.Flight.Tests/FlightTests.cs:line
>  183
> --- End of stack trace from previous location ---
>   Failed Apache.Arrow.Flight.Tests.FlightTests.TestGetSchema [108 ms]
>   Error Message:
>    Grpc.Core.RpcException : Status(StatusCode="Internal", Detail="Error 
> starting gRPC call. HttpRequestException: An HTTP/2 connection could not be 
> established because the server did not complete the HTTP/2 handshake.", 
> DebugException="System.Net.Http.HttpRequestException: An HTTP/2 connection 
> could not be established because the server did not complete the HTTP/2 
> handshake.
>    at 
> System.Net.Http.HttpConnectionPool.ReturnHttp2Connection(Http2Connection 
> connection, Boolean isNewConnection)
>    at 
> System.Net.Http.HttpConnectionPool.AddHttp2ConnectionAsync(HttpRequestMessage 
> request)
>    at 
> System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine&
>  stateMachine)
>    at 
> System.Net.Http.HttpConnectionPool.AddHttp2ConnectionAsync(HttpRequestMessage 
> request)
>    at 
> System.Net.Http.HttpConnectionPool.<>c__DisplayClass78_0.b__0()
>    at System.Threading.Tasks.Task`1.InnerInvoke()
>    at System.Threading.Tasks.Task.<>c.<.cctor>b__272_0(Object obj)
>    at 

[jira] [Updated] (ARROW-18161) [Ruby] Tables can have buffers get GC'ed

2023-01-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18161:
---
Labels: pull-request-available  (was: )

> [Ruby] Tables can have buffers get GC'ed
> 
>
> Key: ARROW-18161
> URL: https://issues.apache.org/jira/browse/ARROW-18161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 9.0.0
> Environment: Ruby 3.1.2
>Reporter: Noah Horton
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ven an Arrow::Table with several columns "X"
>  
> {code:ruby}
> # Rails console outputs
> 3.1.2 :107 > x.schema
>  => 
> # dates: date32[day]                             
> expected_values: double>                       
> 3.1.2 :108 > x.schema
>  => 
> # dates: date32[day]                             
> expected_values: double>                       
> 3.1.2 :109 >  {code}
> Note that the object and pointer have both changed values.
> But the far bigger issue is that repeated reads from it will cause different 
> results:
> {code:ruby}
> 3.1.2 :097 > x[1][0]
>  => Sun, 22 Aug 2021 
> 3.1.2 :098 > x[1][1]
>  => nil 
> 3.1.2 :099 > x[1][0]
>  => nil {code}
> I have a lot of issues like this - when I have done these types of read 
> operations, I get the original table with the data in the columns all 
> shuffled around or deleted. 
> I do ingest the data slightly oddly in the first place as it comes in over 
> GRPC and I am using Arrow::Buffer to read it from the GRPC and then passing 
> that into Arrow::Table.load. But I would not expect that once it was in 
> Arrow::Table that I could do anything to permute it unintentionally.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-11631) [R] Implement RPrimitiveConverter for Decimal type

2023-01-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11631:
---
Labels: pull-request-available  (was: )

> [R] Implement RPrimitiveConverter for Decimal type
> --
>
> Key: ARROW-11631
> URL: https://issues.apache.org/jira/browse/ARROW-11631
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Ian Cook
>Assignee: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This succeeds:
> {code:java}
> Array$create(1)$cast(decimal(10, 2)){code}
> but this fails:
> {code:java}
> Array$create(1, type = decimal(10, 2)){code}
> with error:
> {code:java}
> NotImplemented: Extend{code}
> because the {{Extend}} method of the {{RPromitiveConverter}} class for the 
> Decimal type is not yet implemented.
> The error is thrown here: 
> [https://github.com/apache/arrow/blob/7184c3f46981dd52c3c521b2676796e82f17da77/r/src/r_to_arrow.cpp#L601]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data

2023-01-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18400:
---
Labels: pull-request-available  (was: )

> [Python] Quadratic memory usage of Table.to_pandas with nested data
> ---
>
> Key: ARROW-18400
> URL: https://issues.apache.org/jira/browse/ARROW-18400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>Reporter: Adam Reeve
>Assignee: Alenka Frim
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
> Attachments: test_memory.py
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(1, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } for i in range(arr_len)]
>         ))
> df = pd.DataFrame({'c1': nested_col})
> df.to_parquet(filename)
> {code}
> And then read into a DataFrame with:
> {code:python}
> import pyarrow.parquet as pq
> table = pq.read_table(filename)
> df = table.to_pandas()
> {code}
> Only reading to an Arrow table isn't a problem, it's the to_pandas method 
> that exhibits the large memory usage. I haven't tested generating nested 
> Arrow data in memory without writing Parquet from Pandas but I assume the 
> problem probably isn't Parquet specific.
> Memory usage I see when reading different sized files on a machine with 64 GB 
> RAM:
> ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
> |32,000|362|361|
> |64,000|531|531|
> |128,000|1,152|1,101|
> |256,000|2,888|1,402|
> |512,000|10,301|3,508|
> |1,024,000|38,697|5,313|
> |2,048,000|OOM|20,061|
> |4,096,000| |OOM|
> With Arrow 10.0.1, memory usage approximately quadruples when row count 
> doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but 
> then quadruples from 1024k to 2048k rows.
> PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
> changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18086) [Ruby] Importing table containing float16 array throws error

2023-01-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18086:
---
Labels: pull-request-available  (was: )

> [Ruby] Importing table containing float16 array throws error
> 
>
> Key: ARROW-18086
> URL: https://issues.apache.org/jira/browse/ARROW-18086
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 9.0.0
>Reporter: Atte Keinänen
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In Red Arrow, loading table containing float16 array leads to this error when 
> using IPC streaming format:
> {code:java}
> > Arrow::Table.load(Arrow::Buffer.new(resp.body), format: :arrow_streaming)
> cannot create instance of abstract (non-instantiatable) type 'GArrowDataType' 
> from 
> /usr/local/bundle/gems/gobject-introspection-4.0.3/lib/gobject-introspection/loader.rb:688:in
>  `invoke' from 
> /usr/local/bundle/gems/gobject-introspection-4.0.3/lib/gobject-introspection/loader.rb:559:in
>  `get_field'{code}
> At least using float64 list array this does not happen.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-11402) [C++][Dataset] Allow more aggresive implicit casts for literals

2023-01-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11402:
---
Labels: dataset pull-request-available  (was: dataset)

> [C++][Dataset] Allow more aggresive implicit casts for literals
> ---
>
> Key: ARROW-11402
> URL: https://issues.apache.org/jira/browse/ARROW-11402
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After ARROW-8919, a literal in an Expression may cause unnecessary implicit 
> casting of a column. For example {{equal(field_ref("i8"), literal(1))}} will 
> cause column i8 to be promoted to the type of the literal (int32) for 
> comparison. Since we have access to the literal value at bind time we could 
> examine {{1}} and determine that it can safely be *de*moted to int8, which 
> produces a semantically equivalent and more performant filter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17302) [R] Configure curl timeout policy for S3

2023-01-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17302:
---
Labels: good-first-issue pull-request-available  (was: good-first-issue)

> [R] Configure curl timeout policy for S3
> 
>
> Key: ARROW-17302
> URL: https://issues.apache.org/jira/browse/ARROW-17302
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 9.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See ARROW-16521



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15735) [C++] Hash aggregate functions to return first and last value from a group.

2023-01-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15735:
---
Labels: kernel pull-request-available  (was: kernel)

> [C++] Hash aggregate functions to return first and last value from a group.
> ---
>
> Key: ARROW-15735
> URL: https://issues.apache.org/jira/browse/ARROW-15735
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: A. Coady
>Assignee: Sanjiban Sengupta
>Priority: Major
>  Labels: kernel, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Follow-up to ARROW-13993, which implemented `hash_one` to select an arbitrary 
> value, as the core engine lack support for ordering. I think `first` and 
> `last` will still be in demand though, based on pandas and sql usage.
> It could be done without core changes by using `min_max` on an array of 
> indices. For that reason, maybe it would be better as 
> `hash_\{first,last}_index`, suitable for use with `take`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18318) [Python] Expose Scalar.validate

2023-01-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18318:
---
Labels: good-first-issue pull-request-available  (was: good-first-issue)

> [Python] Expose Scalar.validate
> ---
>
> Key: ARROW-18318
> URL: https://issues.apache.org/jira/browse/ARROW-18318
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In C++, scalars have {{Validate}} and {{ValidateFull}} methods, just like 
> arrays. However, these methods were not exposed on PyArrow scalars (while 
> they are on PyArrow arrays).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15206) [Ruby] Allow to pass schema when loading table from file

2023-01-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15206:
---
Labels: pull-request-available  (was: )

> [Ruby] Allow to pass schema when loading table from file
> 
>
> Key: ARROW-15206
> URL: https://issues.apache.org/jira/browse/ARROW-15206
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Kanstantsin Ilchanka
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is ability to do this in C++, but not in Ruby
> {code:java}
> schema = Arrow::Schema.new(a: :int64, b: :double)
> Arrow::Table.load(URI('file:///tmp/example.csv'), format: :csv, schema: 
> schema){code}
> This should also work when loading multiple files from folder



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18202) [R][C++] Different behaviour of R's base::gsub() binding aka libarrow's replace_string_regex kernel since 10.0.0

2022-12-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18202:
---
Labels: pull-request-available  (was: )

> [R][C++] Different behaviour of R's base::gsub() binding aka libarrow's 
> replace_string_regex kernel since 10.0.0
> 
>
> Key: ARROW-18202
> URL: https://issues.apache.org/jira/browse/ARROW-18202
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 10.0.0
>Reporter: Lorenzo Isella
>Assignee: Will Jones
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hello,
> I think there is a problem with arrow 10.0 and R. I did not have this issue 
> with arrow 9.0.
> Could you please have a look?
> Many thanks
>  
> {code:r}
> library(tidyverse)
> library(arrow)
> ll <- c(      "100",   "1000",  "200"  , "3000" , "50"   ,
>         "500", ""   ,   "Not Range")
> df <- tibble(x=rep(ll, 1000), y=seq(8000))
> write_tsv(df, "data.tsv")
> data <- open_dataset("data.tsv", format="tsv",
>                      skip_rows=1,
>                      schema=schema(x=string(),
>                      y=double())
> )
> test <- data |>
>     collect()
> ###I want to replace the "" with "0". I believe this worked with arrow 9.0
> df2 <- data |>
>     mutate(x=gsub("^$","0",x) ) |>
>     collect()
> df2 ### now I did not modify the  "" entries in x
> #> # A tibble: 8,000 × 2
> #>    x               y
> #>           
> #>  1 "100"       1
> #>  2 "1000"      2
> #>  3 "200"       3
> #>  4 "3000"      4
> #>  5 "50"        5
> #>  6 "500"       6
> #>  7 ""              7
> #>  8 "Not Range"     8
> #>  9 "100"       9
> #> 10 "1000"     10
> #> # … with 7,990 more rows
>  
> df3 <- df |>
>     mutate(x=gsub("^$","0",x) )
> df3  ## and this is fine
> #> # A tibble: 8,000 × 2
> #>    x             y
> #>         
> #>  1 100       1
> #>  2 1000      2
> #>  3 200       3
> #>  4 3000      4
> #>  5 50        5
> #>  6 500       6
> #>  7 0             7
> #>  8 Not Range     8
> #>  9 100       9
> #> 10 1000     10
> #> # … with 7,990 more rows
> ## How to fix this...I believe this issue did not arise with arrow 9.0.
> sessionInfo()
> #> R version 4.2.1 (2022-06-23)
> #> Platform: x86_64-pc-linux-gnu (64-bit)
> #> Running under: Debian GNU/Linux 11 (bullseye)
> #> 
> #> Matrix products: default
> #> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
> #> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
> #> 
> #> locale:
> #>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
> #>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
> #>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
> #>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
> #>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
> #> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
> #> 
> #> attached base packages:
> #> [1] stats     graphics  grDevices utils     datasets  methods   base     
> #> 
> #> other attached packages:
> #>  [1] arrow_10.0.0    forcats_0.5.2   stringr_1.4.1   dplyr_1.0.10   
> #>  [5] purrr_0.3.5     readr_2.1.3     tidyr_1.2.1     tibble_3.1.8   
> #>  [9] ggplot2_3.3.6   tidyverse_1.3.2
> #> 
> #> loaded via a namespace (and not attached):
> #>  [1] lubridate_1.8.0     assertthat_0.2.1    digest_0.6.30      
> #>  [4] utf8_1.2.2          R6_2.5.1            cellranger_1.1.0   
> #>  [7] backports_1.4.1     reprex_2.0.2        evaluate_0.17      
> #> [10] httr_1.4.4          highr_0.9           pillar_1.8.1       
> #> [13] rlang_1.0.6         googlesheets4_1.0.1 readxl_1.4.1       
> #> [16] R.utils_2.12.1      R.oo_1.25.0         rmarkdown_2.17     
> #> [19] styler_1.8.0        googledrive_2.0.0   bit_4.0.4          
> #> [22] munsell_0.5.0       broom_1.0.1         compiler_4.2.1     
> #> [25] modelr_0.1.9        xfun_0.34           pkgconfig_2.0.3    
> #> [28] htmltools_0.5.3     tidyselect_1.2.0    fansi_1.0.3        
> #> [31] crayon_1.5.2        tzdb_0.3.0          dbplyr_2.2.1       
> #> [34] withr_2.5.0         R.methodsS3_1.8.2   grid_4.2.1         
> #> [37] jsonlite_1.8.3      gtable_0.3.1        lifecycle_1.0.3    
> #> [40] DBI_1.1.3           magrittr_2.0.3      scales_1.2.1       
> #> [43] vroom_1.6.0         cli_3.4.1           stringi_1.7.8      
> #> [46] fs_1.5.2            xml2_1.3.3          ellipsis_0.3.2     
> #> [49] generics_0.1.3      vctrs_0.5.0         tools_4.2.1        
> #> [52] bit64_4.0.5         R.cache_0.16.0      glue_1.6.2         
> #> [55] hms_1.1.2          

[jira] [Updated] (ARROW-18195) [R][C++] Final value returned by case_when is NA when input has 64 or more values and 1 or more NAs

2022-12-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18195:
---
Labels: pull-request-available  (was: )

> [R][C++] Final value returned by case_when is NA when input has 64 or more 
> values and 1 or more NAs
> ---
>
> Key: ARROW-18195
> URL: https://issues.apache.org/jira/browse/ARROW-18195
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 10.0.0
>Reporter: Lee Mendelowitz
>Assignee: Will Jones
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
> Attachments: test_issue.R
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There appears to be a bug when processing an Arrow table with NA values and 
> using `dplyr::case_when`. A reproducible example is below: the output from 
> arrow table processing does not match the output when processing a tibble. If 
> the NA's are removed from the dataframe, then the outputs match.
> {noformat}
> ``` r
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #> filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #> intersect, setdiff, setequal, union
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> library(assertthat)
> play_results = c('single', 'double', 'triple', 'home_run')
> nrows = 1000
> # Change frac_na to 0, and the result error disappears.
> frac_na = 0.05
> # Create a test dataframe with NA values
> test_df = tibble(
> play_result = sample(play_results, nrows, replace = TRUE)
> ) %>%
> mutate(
> play_result = ifelse(runif(nrows) < frac_na, NA_character_, 
> play_result)
> )
> 
> test_arrow = arrow_table(test_df)
> process_plays = function(df) {
> df %>%
> mutate(
> avg = case_when(
> play_result == 'single' ~ 1,
> play_result == 'double' ~ 1,
> play_result == 'triple' ~ 1,
> play_result == 'home_run' ~ 1,
> is.na(play_result) ~ NA_real_,
> TRUE ~ 0
> )
> ) %>%
> count(play_result, avg) %>%
> arrange(play_result)
> }
> # Compare arrow_table reuslt to tibble result
> result_tibble = process_plays(test_df)
> result_arrow = process_plays(test_arrow) %>% collect()
> assertthat::assert_that(identical(result_tibble, result_arrow))
> #> Error: result_tibble not identical to result_arrow
> ```
> Created on 2022-10-29 with [reprex 
> v2.0.2](https://reprex.tidyverse.org)
> {noformat}
> I have reproduced this issue both on Mac OS and Ubuntu 20.04.
>  
> {noformat}
> ```
> r$> sessionInfo()
> R version 4.2.1 (2022-06-23)
> Platform: aarch64-apple-darwin21.5.0 (64-bit)
> Running under: macOS Monterey 12.5.1
> Matrix products: default
> BLAS:   /opt/homebrew/Cellar/openblas/0.3.20/lib/libopenblasp-r0.3.20.dylib
> LAPACK: /opt/homebrew/Cellar/r/4.2.1/lib/R/lib/libRlapack.dylib
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> attached base packages:
> [1] stats     graphics  grDevices datasets  utils     methods   base
> other attached packages:
> [1] assertthat_0.2.1 arrow_10.0.0     dplyr_1.0.10
> loaded via a namespace (and not attached):
>  [1] compiler_4.2.1    pillar_1.8.1      highr_0.9         R.methodsS3_1.8.2 
> R.utils_2.12.0    tools_4.2.1       bit_4.0.4         digest_0.6.29
>  [9] evaluate_0.15     lifecycle_1.0.1   tibble_3.1.8      R.cache_0.16.0    
> pkgconfig_2.0.3   rlang_1.0.5       reprex_2.0.2      DBI_1.1.2
> [17] cli_3.3.0         rstudioapi_0.13   yaml_2.3.5        xfun_0.31         
> fastmap_1.1.0     withr_2.5.0       styler_1.8.0      knitr_1.39
> [25] generics_0.1.3    fs_1.5.2          vctrs_0.4.1       bit64_4.0.5       
> tidyselect_1.1.2  glue_1.6.2        R6_2.5.1          processx_3.5.3
> [33] fansi_1.0.3       rmarkdown_2.14    purrr_0.3.4       callr_3.7.0       
> clipr_0.8.0       magrittr_2.0.3    ellipsis_0.3.2    ps_1.7.0
> [41] htmltools_0.5.3   renv_0.16.0       utf8_1.2.2        R.oo_1.25.0
> ```
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-12264) [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

2022-12-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12264:
---
Labels: pull-request-available  (was: )

> [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down
> ---
>
> Key: ARROW-12264
> URL: https://issues.apache.org/jira/browse/ARROW-12264
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Antoine Pitrou
>Assignee: Sanjiban Sengupta
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The Parquet spec (in parquet.thrift) says the following about handling of 
> floating-point statistics:
> {code}
>* (*) Because the sorting order is not specified properly for floating
>* point values (relations vs. total ordering) the following
>* compatibility rules should be applied when reading statistics:
>* - If the min is a NaN, it should be ignored.
>* - If the max is a NaN, it should be ignored.
>* - If the min is +0, the row group may contain -0 values as well.
>* - If the max is -0, the row group may contain +0 values as well.
>* - When looking for NaN values, min and max should be ignored.
> {code}
> It appears that the dataset code uses the following filter expression when 
> doing Parquet predicate push-down (in {{file_parquet.cc}}):
> {code:c++}
> return and_(greater_equal(field_expr, literal(min)),
> less_equal(field_expr, literal(max)));
> {code}
> A NaN value will fail that filter and yet may be found in the given Parquet 
> column chunk.
> We may instead need a "greater_equal_or_nan" comparison that returns true if 
> either value is NaN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16212) [C++][Python] Register Multiple Kernels for a UDF

2022-12-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16212:
---
Labels: pull-request-available  (was: )

> [C++][Python] Register Multiple Kernels for a UDF
> -
>
> Key: ARROW-16212
> URL: https://issues.apache.org/jira/browse/ARROW-16212
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The current UDF integration 
> (https://issues.apache.org/jira/browse/ARROW-15639) doesn't support multiple 
> kernel registration. It only supports registering a single kernel under a 
> given function name. Enabling multiple kernels to be registered under the 
> same function name is a must for usability and consistency with the existing 
> functions in the function registry. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18403) [C++] Error consuming Substrait plan which uses count function: "only unary aggregate functions are currently supported"

2022-12-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18403:
---
Labels: pull-request-available substrait  (was: substrait)

> [C++] Error consuming Substrait plan which uses count function: "only unary 
> aggregate functions are currently supported"
> 
>
> Key: ARROW-18403
> URL: https://issues.apache.org/jira/browse/ARROW-18403
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Nicola Crane
>Priority: Major
>  Labels: pull-request-available, substrait
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-17523 added support for the Substrait extension function "count", but 
> when I write code which produces a Substrait plan which calls it, and then 
> try to run it in Acero, I get an error.
> The plan:
> {code:r}
> message of type 'substrait.Plan' with 3 fields set
> extension_uris {
>   extension_uri_anchor: 1
>   uri: 
> "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml;
> }
> extension_uris {
>   extension_uri_anchor: 2
>   uri: 
> "https://github.com/substrait-io/substrait/blob/main/extensions/functions_comparison.yaml;
> }
> extension_uris {
>   extension_uri_anchor: 3
>   uri: 
> "https://github.com/substrait-io/substrait/blob/main/extensions/functions_aggregate_generic.yaml;
> }
> extensions {
>   extension_function {
> extension_uri_reference: 3
> function_anchor: 2
> name: "count"
>   }
> }
> relations {
>   rel {
> aggregate {
>   input {
> project {
>   common {
> emit {
>   output_mapping: 9
>   output_mapping: 10
>   output_mapping: 11
>   output_mapping: 12
>   output_mapping: 13
>   output_mapping: 14
>   output_mapping: 15
>   output_mapping: 16
>   output_mapping: 17
> }
>   }
>   input {
> read {
>   base_schema {
> names: "int"
> names: "dbl"
> names: "dbl2"
> names: "lgl"
> names: "false"
> names: "chr"
> names: "verses"
> names: "padded_strings"
> names: "some_negative"
> struct_ {
>   types {
> i32 {
>   nullability: NULLABILITY_NULLABLE
> }
>   }
>   types {
> fp64 {
>   nullability: NULLABILITY_NULLABLE
> }
>   }
>   types {
> fp64 {
>   nullability: NULLABILITY_NULLABLE
> }
>   }
>   types {
> bool_ {
>   nullability: NULLABILITY_NULLABLE
> }
>   }
>   types {
> bool_ {
>   nullability: NULLABILITY_NULLABLE
> }
>   }
>   types {
> string {
>   nullability: NULLABILITY_NULLABLE
> }
>   }
>   types {
> string {
>   nullability: NULLABILITY_NULLABLE
> }
>   }
>   types {
> string {
>   nullability: NULLABILITY_NULLABLE
> }
>   }
>   types {
> fp64 {
>   nullability: NULLABILITY_NULLABLE
> }
>   }
> }
>   }
>   local_files {
> items {
>   uri_file: "file:///tmp/RtmpsBsoZJ/file1915f604cff4a"
>   parquet {
>   }
> }
>   }
> }
>   }
>   expressions {
> selection {
>   direct_reference {
> struct_field {
> }
>   }
>   root_reference {
>   }
> }
>   }
>   expressions {
> selection {
>   direct_reference {
> struct_field {
>   field: 1
> }
>   }
>   root_reference {
>   }
> }
>   }
>   expressions {
> selection {
>   direct_reference {
> 

[jira] [Updated] (ARROW-18394) [CI][Python] Nightly pyhon pandas jobs using latest or upstream_devel fail

2022-12-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18394:
---
Labels: Nightly pull-request-available  (was: Nightly)

> [CI][Python] Nightly pyhon pandas jobs using latest or upstream_devel fail
> --
>
> Key: ARROW-18394
> URL: https://issues.apache.org/jira/browse/ARROW-18394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Assignee: Joris Van den Bossche
>Priority: Critical
>  Labels: Nightly, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently the following jobs fail:
> |test-conda-python-3.8-pandas-nightly|https://github.com/ursacomputing/crossbow/actions/runs/3532562061/jobs/5927065343|
> |test-conda-python-3.9-pandas-upstream_devel|https://github.com/ursacomputing/crossbow/actions/runs/3532562477/jobs/5927066168|
> with:
> {code:java}
>   _ test_roundtrip_with_bytes_unicode[columns0] 
> __columns = [b'foo']    @pytest.mark.parametrize('columns', 
> ([b'foo'], ['foo']))
>     def test_roundtrip_with_bytes_unicode(columns):
>         df = pd.DataFrame(columns=columns)
>         table1 = pa.Table.from_pandas(df)
> >       table2 = 
> > pa.Table.from_pandas(table1.to_pandas())opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/tests/test_pandas.py:2867:
> >  
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> pyarrow/array.pxi:830: in pyarrow.lib._PandasConvertible.to_pandas
>     ???
> pyarrow/table.pxi:3908: in pyarrow.lib.Table._to_pandas
>     ???
> opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:819:
>  in table_to_blockmanager
>     columns = _deserialize_column_index(table, all_columns, column_indexes)
> opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:935:
>  in _deserialize_column_index
>     columns = _reconstruct_columns_from_metadata(columns, column_indexes)
> opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:1154:
>  in _reconstruct_columns_from_metadata
>     level = level.astype(dtype)
> opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:1029:
>  in astype
>     return Index(new_values, name=self.name, dtype=new_values.dtype, 
> copy=False)
> opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:518:
>  in __new__
>     klass = cls._dtype_to_subclass(arr.dtype)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ cls = , dtype = dtype('S3')    
> @final
>     @classmethod
>     def _dtype_to_subclass(cls, dtype: DtypeObj):
>         # Delay import for perf. 
> https://github.com/pandas-dev/pandas/pull/31423
>     
>         if isinstance(dtype, ExtensionDtype):
>             if isinstance(dtype, DatetimeTZDtype):
>                 from pandas import DatetimeIndex
>     
>                 return DatetimeIndex
>             elif isinstance(dtype, CategoricalDtype):
>                 from pandas import CategoricalIndex
>     
>                 return CategoricalIndex
>             elif isinstance(dtype, IntervalDtype):
>                 from pandas import IntervalIndex
>     
>                 return IntervalIndex
>             elif isinstance(dtype, PeriodDtype):
>                 from pandas import PeriodIndex
>     
>                 return PeriodIndex
>     
>             return Index
>     
>         if dtype.kind == "M":
>             from pandas import DatetimeIndex
>     
>             return DatetimeIndex
>     
>         elif dtype.kind == "m":
>             from pandas import TimedeltaIndex
>     
>             return TimedeltaIndex
>     
>         elif dtype.kind == "f":
>             from pandas.core.api import Float64Index
>     
>             return Float64Index
>         elif dtype.kind == "u":
>             from pandas.core.api import UInt64Index
>     
>             return UInt64Index
>         elif dtype.kind == "i":
>             from pandas.core.api import Int64Index
>     
>             return Int64Index
>     
>         elif dtype.kind == "O":
>             # NB: assuming away MultiIndex
>             return Index
>     
>         elif issubclass(
>             dtype.type, (str, bool, np.bool_, complex, np.complex64, 
> np.complex128)
>         ):
>             return Index
>     
> >       raise NotImplementedError(dtype)
> E       NotImplementedError: 
> |S3opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:595:
>  NotImplementedError{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17538) [C++] Importing an ArrowArrayStream can't handle errors from get_schema

2022-12-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17538:
---
Labels: good-first-issue pull-request-available  (was: good-first-issue)

> [C++] Importing an ArrowArrayStream can't handle errors from get_schema
> ---
>
> Key: ARROW-17538
> URL: https://issues.apache.org/jira/browse/ARROW-17538
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 9.0.0
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As indicated in the code: 
> https://github.com/apache/arrow/blob/cd3c6ead97d584366aafd2f14d99a1cb8ace9ca2/cpp/src/arrow/c/bridge.cc#L1823
>  
> This probably needs a static initializer so we can catch things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18353) [C++][Flight] Sporadic hang in UCX tests

2022-12-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18353:
---
Labels: pull-request-available  (was: )

> [C++][Flight] Sporadic hang in UCX tests
> 
>
> Key: ARROW-18353
> URL: https://issues.apache.org/jira/browse/ARROW-18353
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Antoine Pitrou
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The UCX tests sometimes hang here.
> Full gdb backtraces for all threads:
> {code}
> Thread 8 (Thread 0x7f4562fcd700 (LWP 76837)):
> #0  0x7f4577b72ad3 in futex_wait_cancelable (private=, 
> expected=0, futex_word=0x564ebe5b5b3c)
> at ../sysdeps/unix/sysv/linux/futex-internal.h:88
> #1  __pthread_cond_wait_common (abstime=0x0, mutex=0x564ebe5b5ae0, 
> cond=0x564ebe5b5b10) at pthread_cond_wait.c:502
> #2  __pthread_cond_wait (cond=0x564ebe5b5b10, mutex=0x564ebe5b5ae0) at 
> pthread_cond_wait.c:655
> #3  0x7f457b4ce7cb in 
> std::condition_variable::wait namespace)::WriteClientStream::WritesDone():: 
> >(std::unique_lock &, struct {...}) (this=0x564ebe5b5b10, 
> __lock=..., __p=...)
> at 
> /opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/condition_variable:111
> #4  0x7f457b4c7b5e in arrow::flight::transport::ucx::(anonymous 
> namespace)::WriteClientStream::WritesDone (this=0x564ebe5b5a90)
> at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_client.cc:277
> #5  0x7f457b4cc989 in arrow::flight::transport::ucx::(anonymous 
> namespace)::UcxClientStream::DoFinish (this=0x564ebe5b5a90)
> at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_client.cc:692
> #6  0x7f457af80e04 in arrow::flight::internal::ClientDataStream::Finish 
> (this=0x564ebe5b5a90, st=...) at /arrow/cpp/src/arrow/flight/transport.cc:46
> #7  0x7f457af4f6e1 in arrow::flight::ClientMetadataReader::ReadMetadata 
> (this=0x564ebe560630, out=0x7f4562fcc170)
> at /arrow/cpp/src/arrow/flight/client.cc:263
> #8  0x7f457b593af6 in operator() (__closure=0x564ebe4e4848) at 
> /arrow/cpp/src/arrow/flight/test_definitions.cc:1538
> #9  0x7f457b5b66b8 in std::__invoke_impl arrow::flight::ErrorHandlingTest::TestDoPut():: 
> >(std::__invoke_other, struct {...} &&) (__f=...)
> at 
> /opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/bits/invoke.h:60
> #10 0x7f457b5b6529 in 
> std::__invoke 
> >(struct {...} &&) (__fn=...)
> at 
> /opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/bits/invoke.h:95
> #11 0x7f457b5b63c4 in 
> std::thread::_Invoker
>  > >::_M_invoke<0>(std::_Index_tuple<0>) (
> this=0x564ebe4e4848) at 
> /opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/thread:264
> #12 0x7f457b5b6224 in 
> std::thread::_Invoker
>  > >::operator()(void) (
> this=0x564ebe4e4848) at 
> /opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/thread:271
> #13 0x7f457b5b5e1e in 
> std::thread::_State_impl
>  > > >::_M_run(void) (this=0x564ebe4e4840) at 
> /opt/conda/envs/arrow/x86_64-conda-linux-gnu/include/c++/10.4.0/thread:215
> #14 0x7f4578242a93 in std::execute_native_thread_routine (__p= out>)
> at 
> /home/conda/feedstock_root/build_artifacts/gcc_compilers_1666516830325/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/new_allocator.h:82
> #15 0x7f4577b6c6db in start_thread (arg=0x7f4562fcd700) at 
> pthread_create.c:463
> #16 0x7f4577ea561f in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> Thread 7 (Thread 0x7f45725ca700 (LWP 76828)):
> #0  0x7f4577ea5947 in epoll_wait (epfd=36, 
> events=events@entry=0x7f45725c86c0, maxevents=16, timeout=timeout@entry=0)
> at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
> #1  0x7f45779fe3e3 in ucs_event_set_wait (event_set=0x7f4564026240, 
> num_events=num_events@entry=0x7f45725c8804, timeout_ms=timeout_ms@entry=0, 
> event_set_handler=event_set_handler@entry=0x7f4575d29320 
> , arg=arg@entry=0x7f45725c8800) at 
> sys/event_set.c:198
> #2  0x7f4575d29283 in uct_tcp_iface_progress (tl_iface=0x7f4564026900) at 
> tcp/tcp_iface.c:327
> #3  0x7f4577a7de22 in ucs_callbackq_dispatch (cbq=) at 
> /usr/local/src/conda/ucx-1.13.1/src/ucs/datastruct/callbackq.h:211
> #4  uct_worker_progress (worker=) at 
> /usr/local/src/conda/ucx-1.13.1/src/uct/api/uct.h:2638
> #5  ucp_worker_progress (worker=0x7f4564000c80) at core/ucp_worker.c:2782
> #6  0x7f457b4f186f in 
> arrow::flight::transport::ucx::UcpCallDriver::Impl::MakeProgress 
> (this=0x7f456404d3b0)
> at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:759
> #7  0x7f457b4eee40 in 
> 

[jira] [Updated] (ARROW-18351) [C++][Flight] Crash in UcxErrorHandlingTest.TestDoExchange

2022-12-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18351:
---
Labels: pull-request-available  (was: )

> [C++][Flight] Crash in UcxErrorHandlingTest.TestDoExchange
> --
>
> Key: ARROW-18351
> URL: https://issues.apache.org/jira/browse/ARROW-18351
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Antoine Pitrou
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I get a non-deterministic crash in the Flight UCX tests.
> {code}
> [--] 3 tests from UcxErrorHandlingTest
> [ RUN  ] UcxErrorHandlingTest.TestGetFlightInfo
> [   OK ] UcxErrorHandlingTest.TestGetFlightInfo (24 ms)
> [ RUN  ] UcxErrorHandlingTest.TestDoPut
> [   OK ] UcxErrorHandlingTest.TestDoPut (15 ms)
> [ RUN  ] UcxErrorHandlingTest.TestDoExchange
> /arrow/cpp/src/arrow/util/future.cc:125:  Check failed: 
> !IsFutureFinished(state_) Future already marked finished
> {code}
> Here is the GDB backtrace:
> {code}
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1  0x7f18c49cd7f1 in __GI_abort () at abort.c:79
> #2  0x7f18c5854e00 in arrow::util::CerrLog::~CerrLog 
> (this=0x7f18a81607b0, __in_chrg=) at 
> /arrow/cpp/src/arrow/util/logging.cc:72
> #3  0x7f18c5854e1c in arrow::util::CerrLog::~CerrLog 
> (this=0x7f18a81607b0, __in_chrg=) at 
> /arrow/cpp/src/arrow/util/logging.cc:74
> #4  0x7f18c5855181 in arrow::util::ArrowLog::~ArrowLog 
> (this=0x7f18c07fc380, __in_chrg=) at 
> /arrow/cpp/src/arrow/util/logging.cc:250
> #5  0x7f18c5826f86 in arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed 
> (this=0x7f18a815f030, state=arrow::FutureState::FAILURE)
> at /arrow/cpp/src/arrow/util/future.cc:125
> #6  0x7f18c58265af in arrow::ConcreteFutureImpl::DoMarkFailed 
> (this=0x7f18a815f030) at /arrow/cpp/src/arrow/util/future.cc:40
> #7  0x7f18c5827660 in arrow::FutureImpl::MarkFailed (this=0x7f18a815f030) 
> at /arrow/cpp/src/arrow/util/future.cc:195
> #8  0x7f18c80ff8d8 in 
> arrow::Future 
> >::DoMarkFinished (this=0x7f18a815efb0, res=...)
> at /arrow/cpp/src/arrow/util/future.h:660
> #9  0x7f18c80fb37d in 
> arrow::Future 
> >::MarkFinished (this=0x7f18a815efb0, res=...)
> at /arrow/cpp/src/arrow/util/future.h:403
> #10 0x7f18c80f5ae3 in 
> arrow::flight::transport::ucx::UcpCallDriver::Impl::Push 
> (this=0x7f18a804d2d0, status=...)
> at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:780
> #11 0x7f18c80f5c1f in 
> arrow::flight::transport::ucx::UcpCallDriver::Impl::RecvActiveMessage 
> (this=0x7f18a804d2d0, header=0x7f18c8081865, header_length=12, 
> data=0x7f18c8081864, data_length=1, param=0x7f18c07fc680) at 
> /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:791
> #12 0x7f18c80f7d29 in 
> arrow::flight::transport::ucx::UcpCallDriver::RecvActiveMessage 
> (this=0x7f18b80017e0, header=0x7f18c8081865, header_length=12, 
> data=0x7f18c8081864, data_length=1, param=0x7f18c07fc680) at 
> /arrow/cpp/src/arrow/flight/transport/ucx/ucx_internal.cc:1082
> #13 0x7f18c80e3ea4 in arrow::flight::transport::ucx::(anonymous 
> namespace)::UcxServerImpl::HandleIncomingActiveMessage (self=0x7f18a80259a0, 
> header=0x7f18c8081865, header_length=12, data=0x7f18c8081864, 
> data_length=1, param=0x7f18c07fc680)
> at /arrow/cpp/src/arrow/flight/transport/ucx/ucx_server.cc:586
> #14 0x7f18c4661a09 in ucp_am_invoke_cb (recv_flags=, 
> reply_ep=, data_length=1, data=, 
> user_hdr_length=, user_hdr=0x7f18c8081865, am_id=4132, 
> worker=) at core/ucp_am.c:1220
> #15 ucp_am_handler_common (name=, recv_flags= out>, am_flags=0, reply_ep=, total_length=, 
> am_hdr=0x7f18c808185c, worker=) at core/ucp_am.c:1289
> #16 ucp_am_handler_reply (am_arg=, am_data=, 
> am_length=, am_flags=) at core/ucp_am.c:1327
> #17 0x7f18c28e3f1c in uct_iface_invoke_am (flags=0, length=29, 
> data=0x7f18c808185c, id=, iface=0x7f18a8027e20)
> at /usr/local/src/conda/ucx-1.13.1/src/uct/base/uct_iface.h:861
> #18 uct_mm_iface_invoke_am (flags=0, length=29, data=0x7f18c808185c, 
> am_id=, iface=0x7f18a8027e20) at sm/mm/base/mm_iface.h:256
> #19 uct_mm_iface_process_recv (iface=0x7f18a8027e20) at 
> sm/mm/base/mm_iface.c:256
> #20 uct_mm_iface_poll_fifo (iface=0x7f18a8027e20) at sm/mm/base/mm_iface.c:304
> #21 uct_mm_iface_progress (tl_iface=0x7f18a8027e20) at 
> sm/mm/base/mm_iface.c:357
> #22 0x7f18c4686e22 in ucs_callbackq_dispatch (cbq=) at 
> /usr/local/src/conda/ucx-1.13.1/src/ucs/datastruct/callbackq.h:211
> #23 uct_worker_progress (worker=) at 
> /usr/local/src/conda/ucx-1.13.1/src/uct/api/uct.h:2638
> #24 

[jira] [Updated] (ARROW-18231) [C++] Cannot override optimization level using CXXFLAGS

2022-12-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18231:
---
Labels: pull-request-available  (was: )

> [C++] Cannot override optimization level using CXXFLAGS
> ---
>
> Key: ARROW-18231
> URL: https://issues.apache.org/jira/browse/ARROW-18231
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In release mode, Arrow C++ unconditionally adds {{-O2}} _at the end_ of the 
> compiler flags.
> So, if you do something like:
> {code:bash}
> export CXXFLAGS=-O0
> cmake ...
> {code}
> the final compilation flags will look like {{-O0 -O2}}, meaning the 
> user-provided optimization level is ignored.
> One can instead use the {{ARROW_CXXFLAGS}} CMake variable, but it only 
> overrides the flags for Arrow itself, not the bundled dependencies.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18438) [Go] firstTimeBitmapWriter.Finish() panics with 8n structs

2022-12-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18438:
---
Labels: pull-request-available  (was: )

> [Go] firstTimeBitmapWriter.Finish() panics with 8n structs
> --
>
> Key: ARROW-18438
> URL: https://issues.apache.org/jira/browse/ARROW-18438
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go, Parquet
>Affects Versions: 10.0.1
>Reporter: Min-Young Wu
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Even after [ARROW-17169|https://issues.apache.org/jira/browse/ARROW-17169] I 
> still get a panic at the same location.
> Below is a test case that panics:
> {code:go}
> func (ps *ParquetIOTestSuite) TestStructWithListOf8Structs() {
>   bldr := array.NewStructBuilder(memory.DefaultAllocator, arrow.StructOf(
>   arrow.Field{
>   Name: "l",
>   Type: arrow.ListOf(arrow.StructOf(
>   arrow.Field{Name: "a", Type: 
> arrow.BinaryTypes.String},
>   )),
>   },
>   ))
>   defer bldr.Release()
>   lBldr := bldr.FieldBuilder(0).(*array.ListBuilder)
>   stBldr := lBldr.ValueBuilder().(*array.StructBuilder)
>   aBldr := stBldr.FieldBuilder(0).(*array.StringBuilder)
>   bldr.AppendNull()
>   bldr.Append(true)
>   lBldr.Append(true)
>   for i := 0; i < 8; i++ {
>   stBldr.Append(true)
>   aBldr.Append(strconv.Itoa(i))
>   }
>   arr := bldr.NewArray()
>   defer arr.Release()
>   field := arrow.Field{Name: "x", Type: arr.DataType(), Nullable: true}
>   expected := array.NewTable(arrow.NewSchema([]arrow.Field{field}, nil),
>   []arrow.Column{*arrow.NewColumn(field, 
> arrow.NewChunked(field.Type, []arrow.Array{arr}))}, -1)
>   defer expected.Release()
>   ps.roundTripTable(expected, false)
> }
> {code}
> I've tried to trim down the input data and this is as minimal as I could get 
> it. And yes:
> * wrapping struct with initial null is required
> * the inner list needs to contain 8 structs (or any multiple of 8)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18436) [Python] `FileSystem.from_uri` doesn't decode %-encoded characters in path

2022-12-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18436:
---
Labels: pull-request-available  (was: )

> [Python] `FileSystem.from_uri` doesn't decode %-encoded characters in path
> --
>
> Key: ARROW-18436
> URL: https://issues.apache.org/jira/browse/ARROW-18436
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 10.0.1
> Environment: - OS: macOS
> - `python=3.9.15:h709bd14_0_cpython` (installed from conda-forge)
> - `pyarrow=10.0.1:py39h2db5b05_1_cpu` (installed from conda-forge)
>Reporter: James Bourbeau
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When attempting to create a new filesystem object from a public dataset in 
> S3, where there is a space in the bucket name, an error is raised.
>  
> Here's a minimal reproducer:
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet") {code}
> which fails with the following traceback:
>  
> {code:java}
> Traceback (most recent call last):
>   File "/Users/james/projects/dask/dask/test.py", line 3, in 
>     result = FileSystem.from_uri("s3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet")
>   File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip 
> data/fhvhv_tripdata_2022-06.parquet'{code}
>  
> Note that things work if I use a different dataset that doesn't have a space 
> in the URI, or if I replace the portion of the URI that has a space with a 
> `*` wildcard
>  
> {code:java}
> from pyarrow.fs import FileSystem
> result = FileSystem.from_uri("s3://ursa-labs-taxi-data/2009/01/data.parquet") 
> # works
>  result = 
> FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # works
> {code}
>  
> The wildcard isn't necessarily equivalent to the original failing URI, but I 
> think highlights that the space is somehow problematic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14832) [R] Implement bindings for stringr::str_remove and stringr::str_remove_all

2022-12-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14832:
---
Labels: pull-request-available  (was: )

> [R] Implement bindings for stringr::str_remove and stringr::str_remove_all
> --
>
> Key: ARROW-14832
> URL: https://issues.apache.org/jira/browse/ARROW-14832
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Nicola Crane
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> https://stringr.tidyverse.org/reference/str_remove.html explains that it is 
> an alias for str_replace with "".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18434) [C++][Parquet] Parquet page index read support

2022-12-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18434:
---
Labels: pull-request-available  (was: )

> [C++][Parquet] Parquet page index read support
> --
>
> Key: ARROW-18434
> URL: https://issues.apache.org/jira/browse/ARROW-18434
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Implement read support for parquet page index and expose it from the reader 
> API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18437) [C++] Parquet DELTA_BINARY_PACKED Page didn't clear the context

2022-12-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18437:
---
Labels: pull-request-available  (was: )

> [C++] Parquet DELTA_BINARY_PACKED Page didn't clear the context
> ---
>
> Key: ARROW-18437
> URL: https://issues.apache.org/jira/browse/ARROW-18437
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet
>Affects Versions: 11.0.0
>Reporter: Xuwei Fu
>Assignee: Xuwei Fu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When calling {{{}flushValues{}}}, it didn't:
>  * clearing the {{total_value_count_}}
>  * Re-advancing buffer for {{kMaxPageHeaderWriterSize}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18425) Add support for Substrait round expression

2022-12-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18425:
---
Labels: pull-request-available  (was: )

> Add support for Substrait round expression
> --
>
> Key: ARROW-18425
> URL: https://issues.apache.org/jira/browse/ARROW-18425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Bryce Mecum
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Work has been started on adding round to Substrait in 
> [https://github.com/substrait-io/substrait/pull/322] and it looks like a 
> mapping needs to be registered on the Acero side for Acero to consume plans 
> with it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18099) [Python] Cannot create pandas categorical from table only with nulls

2022-12-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18099:
---
Labels: pull-request-available python-conversion  (was: python-conversion)

> [Python] Cannot create pandas categorical from table only with nulls
> 
>
> Key: ARROW-18099
> URL: https://issues.apache.org/jira/browse/ARROW-18099
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
> Environment: OSX 12.6
> M1 silicon
>Reporter: Damian Barabonkov
>Priority: Minor
>  Labels: pull-request-available, python-conversion
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A pyarrow Table with only null values cannot be instantiated as a Pandas 
> DataFrame with said column as a category. However, pandas does support 
> "empty" categoricals. Therefore, a simple patch would be to load the pa.Table 
> as an object first and convert, once in pandas, to a categorical which will 
> be empty. However, that does not solve the pyarrow bug at its root.
>  
> Sample reproducible example
> {code:java}
> import pyarrow as pa
> pylist = [{'x': None, '__index_level_0__': 2}, {'x': None, 
> '__index_level_0__': 3}]
> tbl = pa.Table.from_pylist(pylist)
>  
> # Errors
> df_broken = tbl.to_pandas(categories=["x"])
>  
> # Works
> df_works = tbl.to_pandas()
> df_works = df_works.astype({"x": "category"}) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18435) [C++][Java] Update ORC to 1.8.1

2022-12-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18435:
---
Labels: pull-request-available  (was: )

> [C++][Java] Update ORC to 1.8.1
> ---
>
> Key: ARROW-18435
> URL: https://issues.apache.org/jira/browse/ARROW-18435
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Java
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17637) [R] as.Date fails going from timestamp[us] to timestamp[s]

2022-12-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17637:
---
Labels: pull-request-available  (was: )

> [R] as.Date fails going from timestamp[us] to timestamp[s]
> --
>
> Key: ARROW-17637
> URL: https://issues.apache.org/jira/browse/ARROW-17637
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Using as.Date to convert from timestamp to date fails in Arrow even though 
> this is fine in R.
> {code:r}
> library(arrow)
> library(dplyr)
> library(lubridate)
> tf <- tempfile()
> dir.create(tf)
> tbl <- tibble::tibble(x = as_datetime('2022-05-05T00:00:01.676632'))
> write_dataset(tbl, tf)
> open_dataset(tf) %>%
>   mutate(date = as.Date(x)) %>%
>   collect()
> #> Error in `collect()`:
> #> ! Invalid: Casting from timestamp[us, tz=UTC] to timestamp[s, tz=UTC] 
> would lose data: 1651708801676632
> #> /home/nic2/arrow/cpp/src/arrow/compute/exec.cc:799  
> kernel_->exec(kernel_ctx_, input, out)
> #> /home/nic2/arrow/cpp/src/arrow/compute/exec.cc:767  
> ExecuteSingleSpan(input, )
> #> /home/nic2/arrow/cpp/src/arrow/compute/exec/expression.cc:597  
> executor->Execute( ExecBatch(std::move(arguments), all_scalar ? 1 : 
> input.length), )
> #> /home/nic2/arrow/cpp/src/arrow/compute/exec/expression.cc:579  
> ExecuteScalarExpression(call->arguments[i], input, exec_context)
> #> /home/nic2/arrow/cpp/src/arrow/compute/exec/project_node.cc:91  
> ExecuteScalarExpression(simplified_expr, target, plan()->exec_context())
> #> /home/nic2/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:573  
> iterator_.Next()
> #> /home/nic2/arrow/cpp/src/arrow/record_batch.cc:337  ReadNext()
> #> /home/nic2/arrow/cpp/src/arrow/record_batch.cc:351  ToRecordBatches()
> tbl %>%
>   mutate(date = as.Date(x))
> #> # A tibble: 1 × 2
> #>   x   date  
> #> 
> #> 1 2022-05-05 00:00:01 2022-05-05
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18427) [C++] Support negative tolerance in `AsofJoinNode`

2022-12-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18427:
---
Labels: pull-request-available  (was: )

> [C++] Support negative tolerance in `AsofJoinNode`
> --
>
> Key: ARROW-18427
> URL: https://issues.apache.org/jira/browse/ARROW-18427
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, `AsofJoinNode` supports a tolerance that is non-negative, allowing 
> past-joining, i.e., joining right-table rows with a timestamp at or before 
> that of the left-table row. This issue will add support for a negative 
> tolerance, which would allow future-joining too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17361) [R] dplyr::summarize fails with division when divisor is a variable

2022-12-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17361:
---
Labels: aggregation dplyr pull-request-available  (was: aggregation dplyr)

> [R] dplyr::summarize fails with division when divisor is a variable
> ---
>
> Key: ARROW-17361
> URL: https://issues.apache.org/jira/browse/ARROW-17361
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Oliver Reiter
>Priority: Minor
>  Labels: aggregation, dplyr, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hello,
> I found this odd behaviour when trying to compute an aggregate with 
> dplyr::summarize: When I want to use a pre-defined variable to do a divison 
> while aggregating, the execution fails with 'unsupported expression'. When I 
> the value of the variable as is in the aggregation, it works.
>  
> See below:
>  
> {code:java}
> library(dplyr)
> library(arrow)
> small_dataset <- tibble::tibble(
>   ## x = rep(c("a", "b"), each = 5),
>   y = rep(1:5, 2)
> )
> ## convert "small_dataset" into a ...dataset
> tmpdir <- tempfile()
> dir.create(tmpdir)
> write_dataset(small_dataset, tmpdir)
> ## works
> open_dataset(tmpdir) %>%
>   summarize(value = sum(y) / 10) %>%
>   collect()
> ## fails
> scale_factor <- 10
> open_dataset(tmpdir) %>%
>   summarize(value = sum(y) / scale_factor) %>%
>   collect()
> #> Fehler: Error in summarize_eval(names(exprs)[i],
> #> exprs[[i]], ctx, length(.data$group_by_vars) > :
> #   Expression sum(y)/scale_factor is not an aggregate
> #   expression or is not supported in Arrow
> # Call collect() first to pull data into R.
>    {code}
> I was not sure how to name this issue/bug (if it is one), so if there is a 
> clearer, more descriptive title you're welcome to adjust.
>  
> Thanks for your work!
>  
> Oliver
>  
> {code:java}
> > arrow_info()
> Arrow package version: 8.0.0
> Capabilities:
>                
> dataset    TRUE
> substrait FALSE
> parquet    TRUE
> json       TRUE
> s3         TRUE
> utf8proc   TRUE
> re2        TRUE
> snappy     TRUE
> gzip       TRUE
> brotli     TRUE
> zstd       TRUE
> lz4        TRUE
> lz4_frame  TRUE
> lzo       FALSE
> bz2        TRUE
> jemalloc   TRUE
> mimalloc   TRUE
> Memory:
>                   
> Allocator jemalloc
> Current   64 bytes
> Max       41.25 Kb
> Runtime:
>                         
> SIMD Level          avx2
> Detected SIMD Level avx2
> Build:
>                            
> C++ Library Version   8.0.0
> C++ Compiler            GNU
> C++ Compiler Version 12.1.0 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17054) [R] Creating an Array from an object bigger than 2^31 results in an Array of length 0

2022-12-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17054:
---
Labels: pull-request-available  (was: )

> [R] Creating an Array from an object bigger than 2^31 results in an Array of 
> length 0
> -
>
> Key: ARROW-17054
> URL: https://issues.apache.org/jira/browse/ARROW-17054
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Apologies for the lack of proper reprex but it crashes my session when I try 
> to make one.
> I'm working on ARROW-16977 which is all about the reporting of object size 
> having integer overflow issues, but this affects object creation.
> {code:r}
> library(arrow, warn.conflicts = TRUE)
> # works - creates a huge array, hurrah
> big_logical <- vector(mode = "logical", length = .Machine$integer.max)
> big_logical_array <- Array$create(big_logical)
> length(big_logical)
> ## [1] 2147483647
> length(big_logical_array)
> ## [1] 2147483647
> # creates an array of length 0, boo!
> too_big <- vector(mode = "logical", length = .Machine$integer.max + 1) 
> too_big_array <- Array$create(too_big)
> length(too_big)
> ## [1] 2147483648
> length(too_big_array)
> ## [1] 0
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17838) [Python] Unify CMakeLists.txt at python/CMakeLists.txt and python/src/CMakeLists.txt

2022-12-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17838:
---
Labels: pull-request-available  (was: )

> [Python] Unify CMakeLists.txt at python/CMakeLists.txt and 
> python/src/CMakeLists.txt
> 
>
> Key: ARROW-17838
> URL: https://issues.apache.org/jira/browse/ARROW-17838
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16266) [R] Add StructArray$create()

2022-12-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16266:
---
Labels: pull-request-available  (was: )

> [R] Add StructArray$create()
> 
>
> Key: ARROW-16266
> URL: https://issues.apache.org/jira/browse/ARROW-16266
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Nicola Crane
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In ARROW-13371 we implemented the {{make_struct}} compute function bound to 
> {{data.frame()}} / {{tibble()}} in dplyr evaluation; however, we didn't 
> actually implement {{StructArray$create()}}. In ARROW-15168, it turns out 
> that we need to do this to support {{StructArray}} creation from data.frames 
> whose columns aren't all convertable using the internal C++ conversion. The 
> hack used in that PR is below (but we should clearly implement the C++ 
> function instead of using the hack):
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> struct_array <- function(...) {
>   batch <- record_batch(...)
>   array_ptr <- arrow:::allocate_arrow_array()
>   schema_ptr <- arrow:::allocate_arrow_schema()
>   batch$export_to_c(array_ptr, schema_ptr)
>   Array$import_from_c(array_ptr, schema_ptr)
> }
> struct_array(a = 1, b = "two")
> #> StructArray
> #> >
> #> -- is_valid: all not null
> #> -- child 0 type: double
> #>   [
> #> 1
> #>   ]
> #> -- child 1 type: string
> #>   [
> #> "two"
> #>   ]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18401) [R] Failing test on test-r-rhub-ubuntu-gcc-release-latest

2022-12-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18401:
---
Labels: pull-request-available  (was: )

> [R] Failing test on test-r-rhub-ubuntu-gcc-release-latest
> -
>
> Key: ARROW-18401
> URL: https://issues.apache.org/jira/browse/ARROW-18401
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I think this is an R problem where there is a string that is not getting 
> converted to a timestamp (given that the kernel that's mentioned that doesn't 
> exist probably doesn't and shouldn't exist).
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=40090=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=d9b15392-e4ce-5e4c-0c8c-b69645229181=22256
> {code:java}
> ══ Failed tests 
> 
> ── Error ('test-dplyr-query.R:694'): Scalars in expressions match the type of 
> the field, if possible ──
> Error: NotImplemented: Function 'greater' has no kernel matching input types 
> (timestamp[us, tz=UTC], string)
> Backtrace:
>  ▆
>   1. ├─testthat::expect_output(...) at test-dplyr-query.R:694:2
>   2. │ └─testthat:::quasi_capture(...)
>   3. │   ├─testthat (local) .capture(...)
>   4. │   │ └─testthat::capture_output_lines(code, print, width = width)
>   5. │   │   └─testthat:::eval_with_output(code, print = print, width = width)
>   6. │   │ ├─withr::with_output_sink(path, withVisible(code))
>   7. │   │ │ └─base::force(code)
>   8. │   │ └─base::withVisible(code)
>   9. │   └─rlang::eval_bare(quo_get_expr(.quo), quo_get_env(.quo))
>  10. ├─tab %>% filter(times > "2018-10-07 19:04:05") %>% ...
>  11. └─arrow::show_exec_plan(.)
>  12.   ├─arrow::as_record_batch_reader(adq)
>  13.   └─arrow:::as_record_batch_reader.arrow_dplyr_query(adq)
>  14. └─plan$Build(x)
>  15.   └─node$Filter(.data$filtered_rows)
>  16. ├─self$preserve_extras(ExecNode_Filter(self, expr))
>  17. └─arrow:::ExecNode_Filter(self, expr)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18429) [R] Bump dev version following 10.0.1 patch release

2022-12-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18429:
---
Labels: pull-request-available  (was: )

> [R] Bump dev version following 10.0.1 patch release
> ---
>
> Key: ARROW-18429
> URL: https://issues.apache.org/jira/browse/ARROW-18429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> CI job fails with:
> {code:java}
>Insufficient package version (submitted: 10.0.0.9000, existing: 10.0.1)
>   Version contains large components (10.0.0.9000)
> {code}
> https://github.com/apache/arrow/actions/runs/3639669477/jobs/6145488845#step:10:567



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18320) [C++] Flight client may crash due to improper Result/Status conversion

2022-12-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18320:
---
Labels: pull-request-available  (was: )

> [C++] Flight client may crash due to improper Result/Status conversion
> --
>
> Key: ARROW-18320
> URL: https://issues.apache.org/jira/browse/ARROW-18320
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Affects Versions: 6.0.0
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Reported on user@ 
> https://lists.apache.org/thread/84z329t1djhnbr5bq936v4hr8cyngj2l 
> {noformat}
> I have an issue on my project, we have a query execution engine that
> returns result data as a flight stream and c++ client that receives the
> stream. In case a query has no results but the result schema implies
> dictionary encoded fields in results we have client app crushed.
> The cause is in cpp/src/arrow/flight/client.cc:461:
> ::arrow::Result> ReadNextMessage() override {
> if (stream_finished_) {
> return nullptr;
> }
> internal::FlightData* data;
> {
> auto guard = read_mutex_ ? std::unique_lock(*read_mutex_)
> : std::unique_lock();
> peekable_reader_->Next();
> }
> if (!data) {
> stream_finished_ = true;
> return stream_->Finish(Status::OK()); // Here the issue
> }
> // Validate IPC message
> auto result = data->OpenMessage();
> if (!result.ok()) {
> return stream_->Finish(std::move(result).status());
> }
> *app_metadata_ = std::move(data->app_metadata);
> return result;
> }
> The method returns Result object while stream_Finish(..) returns a Status.
> So there is an implicit conversion from Status to Result that causes
> Result(Status) constructor to be called, but the constructor expects only
> error statuses which in turn causes the app to be failed:
> /// Constructs a Result object with the given non-OK Status object. All
> /// calls to ValueOrDie() on this object will abort. The given `status` must
> /// not be an OK status, otherwise this constructor will abort.
> ///
> /// This constructor is not declared explicit so that a function with a
> return
> /// type of `Result` can return a Status object, and the status will be
> /// implicitly converted to the appropriate return type as a matter of
> /// convenience.
> ///
> /// \param status The non-OK Status object to initialize to.
> Result(const Status& status) noexcept // NOLINT(runtime/explicit)
> : status_(status) {
> if (ARROW_PREDICT_FALSE(status.ok())) {
> internal::DieWithMessage(std::string("Constructed with a non-error status: ")
> +
> status.ToString());
> }
> }
> Is there a way to workaround or fix it? We use Arrow 6.0.0, but it seems
> that the issue exists in all future versions.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18424) [C++] Fix Doxygen error on `arrow::engine::ConversionStrictness`

2022-12-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18424:
---
Labels: pull-request-available  (was: )

> [C++] Fix Doxygen error on `arrow::engine::ConversionStrictness`
> 
>
> Key: ARROW-18424
> URL: https://issues.apache.org/jira/browse/ARROW-18424
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Doxygen is hitting the following error: 
> `/arrow/cpp/src/arrow/engine/substrait/options.h:37: error: documented symbol 
> 'enum ARROW_ENGINE_EXPORT arrow::engine::arrow::engine::ConversionStrictness' 
> was not declared or defined. (warning treated as error, aborting now)`. See 
> [this CI job 
> output|https://github.com/apache/arrow/actions/runs/3557712768/jobs/5975904381],
>  for example.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18281) [C++][Python] Support start == stop in list_slice kernel

2022-12-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18281:
---
Labels: C++ Python pull-request-available  (was: C++ Python)

> [C++][Python] Support start == stop in list_slice kernel
> 
>
> Key: ARROW-18281
> URL: https://issues.apache.org/jira/browse/ARROW-18281
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Miles Granger
>Assignee: Miles Granger
>Priority: Major
>  Labels: C++, Python, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [GitHub PR 14395 | https://github.com/apache/arrow/pull/14395] adds the 
> {{list_slice}} kernel, but does not implement the case where {{stop == 
> stop}}, which should return empty lists.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18423) [Python] Expose reading a schema from an IPC message

2022-12-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18423:
---
Labels: pull-request-available  (was: )

> [Python] Expose reading a schema from an IPC message
> 
>
> Key: ARROW-18423
> URL: https://issues.apache.org/jira/browse/ARROW-18423
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Andre Kohn
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Pyarrow currently does not implement reading the Arrow schema from an IPC 
> message.
> [https://github.com/apache/arrow/blob/80b389efe902af376a85a8b3740e0dbdc5f80900/python/pyarrow/ipc.pxi#L1094]
>  
> We'd like to consume Arrow IPC stream data like the following:
>  
> {code:java}
> schema_msg = pyarrow.ipc.read_message(result_iter.next().data)
> schema = pyarrow.ipc.read_schema(schema_msg)
> for batch_data in result_iter:
> batch_msg = pyarrow.ipc.read_message(batch_data.data)
>     batch = pyarrow.ipc.read_record_batch(batch_msg, schema){code}
>  
> The associated (tiny) PR on GitHub implements this reading by binding the 
> existing C++ function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18422) [C++] Provide enum reflection utility

2022-12-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18422:
---
Labels: pull-request-available  (was: )

> [C++] Provide enum reflection utility
> -
>
> Key: ARROW-18422
> URL: https://issues.apache.org/jira/browse/ARROW-18422
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now that we have c++17, we could try again with ARROW-13296



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17374) [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND

2022-12-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17374:
---
Labels: pull-request-available  (was: )

> [R] R Arrow install fails with SNAPPY_LIB-NOTFOUND
> --
>
> Key: ARROW-17374
> URL: https://issues.apache.org/jira/browse/ARROW-17374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0, 8.0.1, 9.0.0
> Environment: Amazon Linux 2 (RHEL) - 5.10.102-99.473.amzn2.x86_64
>Reporter: Shane Brennan
>Priority: Blocker
>  Labels: pull-request-available
> Attachments: build-images.out, environment.yml
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I've been trying to install Arrow on an R notebook within AWS SageMaker. 
> SageMaker provides Jupyter-like notebooks, with each instance running Amazon 
> Linux 2 as its OS, itself based on RHEL. 
> Trying to install a few ways, e.g., using the standard binaries, using the 
> nightly builds, setting ARROW_WITH_SNAPPY to ON and LIBARROW_MINIMAL all 
> still result in the following error. 
> {noformat}
> x86_64-conda-linux-gnu-c++ -std=gnu++11 -shared 
> -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -Wl,-O2 -Wl,--sort-common 
> -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags 
> -Wl,--gc-sections -Wl,--allow-shlib-undefined 
> -Wl,-rpath,/home/ec2-user/anaconda3/envs/R/lib 
> -Wl,-rpath-link,/home/ec2-user/anaconda3/envs/R/lib 
> -L/home/ec2-user/anaconda3/envs/R/lib -o arrow.so RTasks.o altrep.o array.o 
> array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o 
> compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o 
> expression.o extension-impl.o feather.o field.o filesystem.o imports.o io.o 
> json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o 
> recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o 
> schema.o symbols.o table.o threadpool.o type_infer.o 
> -L/tmp/Rtmpuh87oc/R.INSTALL67114493a3de/arrow/libarrow/arrow-9.0.0.20220809/lib
>  -larrow_dataset -lparquet -larrow -larrow_bundled_dependencies -lz 
> SNAPPY_LIB-NOTFOUND /home/ec2-user/anaconda3/envs/R/lib/libbz2.so -pthread 
> -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto 
> -lcurl -lssl -lcrypto -lcurl -L/home/ec2-user/anaconda3/envs/R/lib/R/lib -lR
> x86_64-conda-linux-gnu-c++: error: SNAPPY_LIB-NOTFOUND: No such file or 
> directory
> make: *** [/home/ec2-user/anaconda3/envs/R/lib/R/share/make/shlib.mk:10: 
> arrow.so] Error 1{noformat}
> Snappy is installed on the systems, and both shared object (.so) and cmake 
> files are there, where I've tried setting the system env variables Snappy_DIR 
> and Snappy_LIB to point at them, but to no avail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18419) [C++] Update vendored fast_float

2022-12-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18419:
---
Labels: pull-request-available  (was: )

> [C++] Update vendored fast_float
> 
>
> Key: ARROW-18419
> URL: https://issues.apache.org/jira/browse/ARROW-18419
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For https://github.com/fastfloat/fast_float/pull/147 .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18333) [Go][Docs] Add and Update compute function docs

2022-12-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18333:
---
Labels: pull-request-available  (was: )

> [Go][Docs] Add and Update compute function docs
> ---
>
> Key: ARROW-18333
> URL: https://issues.apache.org/jira/browse/ARROW-18333
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18421) [C++][ORC] Add accessor for number of rows by stripe in reader

2022-12-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18421:
---
Labels: pull-request-available  (was: )

> [C++][ORC] Add accessor for number of rows by stripe in reader
> --
>
> Key: ARROW-18421
> URL: https://issues.apache.org/jira/browse/ARROW-18421
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Louis Calot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I need to have the number of rows by stripe to be able to read specific 
> ranges of records in the ORC file without reading it all. The number of rows 
> was already stored in the implementation but not available in the API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18420) [C++][Parquet] Introduce ColumnIndex and OffsetIndex

2022-12-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18420:
---
Labels: pull-request-available  (was: )

> [C++][Parquet] Introduce ColumnIndex and OffsetIndex
> 
>
> Key: ARROW-18420
> URL: https://issues.apache.org/jira/browse/ARROW-18420
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Parquet
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Define interface of ColumnIndex and OffsetIndex and provide implementation to 
> read from serialized form.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18391) [R] Fix the version selector dropdown

2022-12-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18391:
---
Labels: pull-request-available  (was: )

> [R] Fix the version selector dropdown
> -
>
> Key: ARROW-18391
> URL: https://issues.apache.org/jira/browse/ARROW-18391
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-17887 updates the docs to use Bootstrap 5 which will break the docs 
> version dropdown selector, as it relies on replacing a page element, but the 
> page elements are different in this version of Bootstrap.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18417) [C++] Support emit info in Substrait extension-multi and AsOfJoin

2022-12-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18417:
---
Labels: pull-request-available  (was: )

> [C++] Support emit info in Substrait extension-multi and AsOfJoin
> -
>
> Key: ARROW-18417
> URL: https://issues.apache.org/jira/browse/ARROW-18417
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, Arrow-Substrait does not handle emit info that may appear in an 
> extension-multi in a Substrait plan. Besides the generic handling in the 
> Arrow-Substrait extension API, specific handling for AsOfJoin is required, 
> because AsOfJoinNode produces an output schema that is different than the one 
> used in the emit info. In particular, the AsOfJoinNode output scheme does not 
> include on- and by-keys of right tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18112) [Go] Remaining Scalar Unary Arithmetic (sin/cos/etc. rounding, log/ln, etc.)

2022-11-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18112:
---
Labels: pull-request-available  (was: )

> [Go] Remaining Scalar Unary Arithmetic (sin/cos/etc. rounding, log/ln, etc.)
> 
>
> Key: ARROW-18112
> URL: https://issues.apache.org/jira/browse/ARROW-18112
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18412) [R] Windows build fails because of missing ChunkResolver symbols

2022-11-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18412:
---
Labels: pull-request-available  (was: )

> [R] Windows build fails because of missing ChunkResolver symbols
> 
>
> Key: ARROW-18412
> URL: https://issues.apache.org/jira/browse/ARROW-18412
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In recent nightly builds of the Windows package we have a build failure 
> because some symbols related to the {{ChunkResolver}} are not found in the 
> linking stage.
> https://github.com/ursacomputing/crossbow/actions/runs/3559717769/jobs/5979255297#step:9:2818
> [~kou] suggested the following patch might fix the build: 
> https://github.com/apache/arrow/pull/14530#issuecomment-1328341447



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18416) [R] Update NEWS for 10.0.1

2022-11-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18416:
---
Labels: pull-request-available  (was: )

> [R] Update NEWS for 10.0.1
> --
>
> Key: ARROW-18416
> URL: https://issues.apache.org/jira/browse/ARROW-18416
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18402) [C++] Expose `DeclarationInfo`

2022-11-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18402:
---
Labels: pull-request-available  (was: )

> [C++] Expose `DeclarationInfo`
> --
>
> Key: ARROW-18402
> URL: https://issues.apache.org/jira/browse/ARROW-18402
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> `DeclarationInfo` is just a pair of `Declaration` and `Schema`, which are 
> public APIs, and so can be made public API itself. This can be part of or a 
> follow-up on [https://github.com/apache/arrow/pull/14485], and will allow 
> implementing extension providers, whose API depends on `DeclarationInfo`, 
> outside of the Arrow repo.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18123) [Python] Cannot use multi-byte characters in file names in write_table

2022-11-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18123:
---
Labels: pull-request-available  (was: )

> [Python] Cannot use multi-byte characters in file names in write_table
> --
>
> Key: ARROW-18123
> URL: https://issues.apache.org/jira/browse/ARROW-18123
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: Miles Granger
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Error when specifying a file path containing multi-byte characters in 
> {{pyarrow.parquet.write_table}}.
> For example, use {{例.parquet}} as the file path.
> {code:python}
> Python 3.10.7 (main, Oct  5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pandas as pd
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
> ...'two': ['foo', 'bar', 'baz'],
> ...'three': [True, False, True]},
> ...index=list('abc'))
> >>> table = pa.Table.from_pandas(df)
> >>> import pyarrow.parquet as pq
> >>> pq.write_table(table, '例.parquet')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 2920, in write_table
> with ParquetWriter(
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 911, in __init__
> filesystem, path = _resolve_filesystem_and_path(
>   File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line
> 184, in _resolve_filesystem_and_path
> filesystem, path = FileSystem.from_uri(path)
>   File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18119) [C++] Utility method to ensure an array object meetings an alignment requirement

2022-11-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18119:
---
Labels: pull-request-available  (was: )

> [C++] Utility method to ensure an array object meetings an alignment 
> requirement
> 
>
> Key: ARROW-18119
> URL: https://issues.apache.org/jira/browse/ARROW-18119
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Sanjiban Sengupta
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This would look something like:
> EnsureAligned(Buffer|Array|ChunkedArray|RecordBatch|Table, int 
> minimum_alignment, MemoryPool* memory_pool);
> It would fail if MemoryPool's alignment < minimum_alignment
> It would iterate through each buffer of the object, if the object is not 
> aligned properly, it would reallocate and copy the buffer (using memory_pool)
> It would return a new object where every buffer is guaranteed to meet the 
> alignment requirements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18413) [C++][Parquet] FileMetaData exposes page index metadata

2022-11-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18413:
---
Labels: pull-request-available  (was: )

> [C++][Parquet] FileMetaData exposes page index metadata
> ---
>
> Key: ARROW-18413
> URL: https://issues.apache.org/jira/browse/ARROW-18413
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Parquet
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Parquet ColumnChunk thrift object has recorded metadata for page index:
> [parquet-format/parquet.thrift at master · apache/parquet-format 
> (github.com)|https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L799]
> We just need to add public API to ColumnChunkMetaData to make it ready to 
> read.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18395) [C++] Move select-k implementation into separate module

2022-11-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18395:
---
Labels: good-second-issue pull-request-available  (was: good-second-issue)

> [C++] Move select-k implementation into separate module
> ---
>
> Key: ARROW-18395
> URL: https://issues.apache.org/jira/browse/ARROW-18395
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Ben Harkins
>Priority: Minor
>  Labels: good-second-issue, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The select-k kernel implementations are currently in {{vector_sort.cc}}, 
> amongst other things.
> To make the code more readable and faster to compiler, we should move them 
> into their own file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18316) [CI] Migrate jobs on Travis CI to dev/tasks/

2022-11-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18316:
---
Labels: pull-request-available  (was: )

> [CI] Migrate jobs on Travis CI to dev/tasks/
> 
>
> Key: ARROW-18316
> URL: https://issues.apache.org/jira/browse/ARROW-18316
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://cwiki.apache.org/confluence/display/INFRA/Travis+Migrations
> {quote}
> On November 2nd, 2020, Travis-CI announced the end of unlimited support for 
> open source projects.
> Infra is therefore moving our CI offerings away from Travis-CI in order to 
> keep our builds pipeline  cost-effective
> Deadline: December 31st 2022
> Infrastructure is moving ASF projects away from using Travis-CI at the end of 
> 2022.
> {quote}
> We need to migrate jobs in {{.travis.yml}} to {{dev/tasks/}}. {{dev/tasks/}} 
> are triggered by Crossbow. And Crossbow used in apache/arrow ( 
> https://github.com/ursacomputing/crossbow/ ) can still use Travis CI ( 
> https://app.travis-ci.com/github/ursacomputing/crossbow ) because the Travis 
> CI account is sponsored by Voltron Data not ASF.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18280) [C++][Python] Support slicing to arbitrary end in list_slice kernel

2022-11-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18280:
---
Labels: pull-request-available  (was: )

> [C++][Python] Support slicing to arbitrary end in list_slice kernel
> ---
>
> Key: ARROW-18280
> URL: https://issues.apache.org/jira/browse/ARROW-18280
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Miles Granger
>Assignee: Miles Granger
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [GitHub PR | https://github.com/apache/arrow/pull/14395] adds support for 
> {{list_slice}} kernel, but does not implement what to do when {{stop == 
> std::nullopt}}, which should slice to the end of the list elements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-10158) [C++] Add support Parquet with Page Index

2022-11-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10158:
---
Labels: pull-request-available  (was: )

> [C++] Add support Parquet with Page Index
> -
>
> Key: ARROW-10158
> URL: https://issues.apache.org/jira/browse/ARROW-10158
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Malthe Borch
>Assignee: Gang Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A recent parquet-format release also added support for a [Page Index 
> |https://github.com/apache/parquet-format/blob/96a8f3172a3b895408d2d1b939200dd02ab8300d/PageIndex.md]
>  making it possible to skip pages within a RowGroup.
> This should be implemented by Apache Arrow.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18106) [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"

2022-11-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18106:
---
Labels: json pull-request-available  (was: json)

> [C++] JSON reader ignores explicit schema with default 
> unexpected_field_behavior="infer"
> 
>
> Key: ARROW-18106
> URL: https://issues.apache.org/jira/browse/ARROW-18106
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Harkins
>Priority: Major
>  Labels: json, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Not 100% sure this is a "bug", but at least I find it an unexpected interplay 
> between two options.
> By default, when reading json, we _infer_ the data type of columns, and when 
> specifying an explicit schema, we _also_ by default infer the type of columns 
> that are not specified in the explicit schema. The docs for 
> {{unexpected_field_behavior}}:
> > How JSON fields outside of explicit_schema (if given) are treated
> But it seems that if you specify a schema, and the parsing of one of the 
> columns fails according to that schema, we still fall back to this default of 
> inferring the data type (while I would have expected an error, since we 
> should only infer for columns _not_ in the schema.
> Example code using pyarrow:
> {code:python}
> import io
> import pyarrow as pa
> from pyarrow import json
> s_json = """{"column":"2022-09-05T08:08:46.000"}"""
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]))
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> The parsing fails here because there are milliseconds and the type is "s", 
> but the explicit schema is ignored, and we get a result with a string column 
> as result:
> {code}
> pyarrow.Table
> column: string
> 
> column: [["2022-09-05T08:08:46.000"]]
> {code}
> But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the 
> expected parse error:
> {code:python}
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]), unexpected_field_behavior="ignore")
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> gives
> {code}
> ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't 
> parse:2022-09-05T08:08:46.000
> {code}
> It might be this is specific to timestamps, I don't directly see a similar 
> issue with eg {{"column": "A"}} and setting the schema to "column" being 
> int64.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18410) [Packaging][Ubuntu] Add support for Ubuntu 22.10

2022-11-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18410:
---
Labels: pull-request-available  (was: )

> [Packaging][Ubuntu] Add support for Ubuntu 22.10
> 
>
> Key: ARROW-18410
> URL: https://issues.apache.org/jira/browse/ARROW-18410
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18409) [GLib][Plasma] Suppress deprecated warning in building plasma-glib

2022-11-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18409:
---
Labels: pull-request-available  (was: )

> [GLib][Plasma] Suppress deprecated warning in building plasma-glib
> --
>
> Key: ARROW-18409
> URL: https://issues.apache.org/jira/browse/ARROW-18409
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If we always get "Plasma is deprecated since Arrow 10.0.0. ..." warning from 
> {{plasma/common.h}}, we can't use {{-Dwerror=true}} Meson option with 
> plama-glib.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18405) [Ruby] Raw table converter rebuilds chunked arrays

2022-11-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18405:
---
Labels: pull-request-available  (was: )

> [Ruby] Raw table converter rebuilds chunked arrays
> --
>
> Key: ARROW-18405
> URL: https://issues.apache.org/jira/browse/ARROW-18405
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 10.0.0
>Reporter: Sten Larsson
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Consider the following Ruby script:
> {code:ruby}
> require 'arrow'
> data = Arrow::ChunkedArray.new([Arrow::Int64Array.new([1])])
> table = Arrow::Table.new('column' => data)
> puts table['column'].data_type
> {code}
> This prints "int64" with red-arrow 9.0.0 and "uint8" in 10.0.0.
> From my understanding it is due to this commit: 
> [https://github.com/apache/arrow/commit/913d9c0a9a1a4398ed5f56d713d586770b4f702c#diff-f7f19bbc3945ea30ba06d851705f2d58f7666507bb101c4e151014ca398bd635R42]
> The old version would not call ArrayBuilder.build on a ChunkedArray, but the 
> new version does. This is a problem for us, because we need the column to 
> stay int64.
> A workaround is to specify a schema and list of arrays instead to bypass the 
> raw table converter:
> {code:ruby}
> table = Arrow::Table.new([{name: 'column', type: 'int64'}], [data])
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18407) [Release][Website] Use UTC for release date

2022-11-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18407:
---
Labels: pull-request-available  (was: )

> [Release][Website] Use UTC for release date
> ---
>
> Key: ARROW-18407
> URL: https://issues.apache.org/jira/browse/ARROW-18407
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools, Website
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18406) [C++] Can't build Arrow with Substrait on Ubuntu 20.04

2022-11-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18406:
---
Labels: pull-request-available  (was: )

> [C++] Can't build Arrow with Substrait on Ubuntu 20.04
> --
>
> Key: ARROW-18406
> URL: https://issues.apache.org/jira/browse/ARROW-18406
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I recently tried to rebuild Arrow with Substrait on Ubuntu 20.04 and got the 
> following error:
> {code:java}
> [100%] Building CXX object 
> src/arrow/engine/CMakeFiles/arrow_substrait_objlib.dir/substrait/type_internal.cc.o
> /home/dewey/Desktop/r/arrow/cpp/src/arrow/engine/substrait/expression_internal.cc:
>  In function ‘arrow::Status arrow::engine::DecodeArg(const 
> substrait::FunctionArgument&, int, arrow::engine::SubstraitCall*, const 
> arrow::engine::ExtensionSet&, const arrow::engine::ConversionOptions&)’:
> /home/dewey/Desktop/r/arrow/cpp/src/arrow/engine/substrait/expression_internal.cc:60:21:
>  error: ‘bool substrait::FunctionArgument::has_enum_() const’ is private 
> within this context
>60 |   if (arg.has_enum_()) {
>   | ^
> In file included from 
> /home/dewey/Desktop/r/arrow/cpp/src/arrow/engine/substrait/expression_internal.h:30,
>  from 
> /home/dewey/Desktop/r/arrow/cpp/src/arrow/engine/substrait/expression_internal.cc:20:
> /home/dewey/.r-arrow-dev-build/build/substrait_ep-generated/substrait/algebra.pb.h:21690:13:
>  note: declared private here
> 21690 | inline bool FunctionArgument::has_enum_() const {
>   | ^~~~
> [100%] Building CXX object 
> src/arrow/engine/CMakeFiles/arrow_substrait_objlib.dir/substrait/util.cc.o
> make[2]: *** 
> [src/arrow/engine/CMakeFiles/arrow_substrait_objlib.dir/build.make:76: 
> src/arrow/engine/CMakeFiles/arrow_substrait_objlib.dir/substrait/expression_internal.cc.o]
>  Error 1
> make[2]: *** Waiting for unfinished jobs
> make[1]: *** [CMakeFiles/Makefile2:2028: 
> src/arrow/engine/CMakeFiles/arrow_substrait_objlib.dir/all] Error 2
> make: *** [Makefile:146: all] Error 2
> {code}
> [~westonpace] suggested that it is probably a protobuf version problem! For 
> me this is:
> {code:java}
> $ protoc --version
> libprotoc 3.6.1
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18380) MIGRATION: Enable bot handling of GitHub issue linked PRs

2022-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18380:
---
Labels: pull-request-available  (was: )

> MIGRATION: Enable bot handling of GitHub issue linked PRs
> -
>
> Key: ARROW-18380
> URL: https://issues.apache.org/jira/browse/ARROW-18380
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Developer Tools
>Reporter: Todd Farmer
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> GitHub workflows for the Apache Arrow project assume that PRs reference ASF 
> Jira issues (or are minor changes). This needs to be revisited now that 
> GitHub issue reporting is enabled, as there may well be no ASF Jira issue to 
> link a PR against going forward. The resulting bot comments can be confusing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18399) [Python] Reduce warnings during tests

2022-11-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18399:
---
Labels: pull-request-available  (was: )

> [Python] Reduce warnings during tests
> -
>
> Key: ARROW-18399
> URL: https://issues.apache.org/jira/browse/ARROW-18399
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Miles Granger
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Numerous warnings are displayed at the end of a test run, we should strive 
> them to reduce them:
> https://github.com/apache/arrow/actions/runs/3533792571/jobs/5929880345#step:6:5489



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18113) Implement a read range process without caching

2022-11-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18113:
---
Labels: pull-request-available  (was: )

> Implement a read range process without caching
> --
>
> Key: ARROW-18113
> URL: https://issues.apache.org/jira/browse/ARROW-18113
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Percy Camilo Triveño Aucahuasi
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The current 
> [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100]
>  is mixing caching with coalescing and making difficult to implement readers 
> capable to really perform concurrent reads on coalesced data (see this 
> [github 
> comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for 
> additional context); for instance, right now the prebuffering feature of 
> those readers cannot handle concurrent invocations.
> The goal for this ticket is to implement a similar component to 
> ReadRangeCache for performing non-cache reads (doing only the coalescing part 
> instead).  So, once we have that new capability, we can port the parquet and 
> IPC readers to this new component and keep improving the reading process 
> (that would be part of other set of follow-up tickets).  Similar ideas were 
> mentioned here https://issues.apache.org/jira/browse/ARROW-17599
> Maybe a good place to implement this new capability is inside the file system 
> abstraction (as part of a dedicated method to read coalesced data) and where 
> the abstract file system can provide a default implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18397) [C++] Clear S3 region resolver client at S3 shutdown

2022-11-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18397:
---
Labels: pull-request-available  (was: )

> [C++] Clear S3 region resolver client at S3 shutdown
> 
>
> Key: ARROW-18397
> URL: https://issues.apache.org/jira/browse/ARROW-18397
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.2, 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The S3 region resolver caches a S3 client at module scope. This client can be 
> destroyed very late and trigger an assertion error in the AWS SDK because it 
> was already shutdown:
> https://github.com/aws/aws-sdk-cpp/issues/2204
> When explicitly finalizing S3, we should ensure we also destroy the cached S3 
> client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18272) [pyarrow] ParquetFile does not recognize GCS cloud path as a string

2022-11-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18272:
---
Labels: pull-request-available  (was: )

> [pyarrow] ParquetFile does not recognize GCS cloud path as a string
> ---
>
> Key: ARROW-18272
> URL: https://issues.apache.org/jira/browse/ARROW-18272
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.0
>Reporter: Zepu Zhang
>Assignee: Miles Granger
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I have a Parquet file at
>  
> path = 'gs://mybucket/abc/d.parquet'
>  
> `pyarrow.parquet.read_metadata(path)` works fine.
>  
> `pyarrow.parquet.ParquetFile(path)` raises "Failed to open local file 
> 'gs://mybucket/abc/d.parquet'.
>  
> Looks like ParquetFile misses the path resolution logic found in 
> `read_metadata`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18392) [CI][Python] Some nightly python tests fail due to ACCESS DENIED to S3 bucket

2022-11-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18392:
---
Labels: Nightly pull-request-available  (was: Nightly)

> [CI][Python] Some nightly python tests fail due to ACCESS DENIED to S3 bucket 
> --
>
> Key: ARROW-18392
> URL: https://issues.apache.org/jira/browse/ARROW-18392
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Assignee: Miles Granger
>Priority: Critical
>  Labels: Nightly, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Several nightly tests fail with:
> {code:java}
>  === FAILURES 
> ===
>  test_s3fs_wrong_region 
>     @pytest.mark.s3
>     def test_s3fs_wrong_region():
>         from pyarrow.fs import S3FileSystem
>     
>         # wrong region for bucket
>         fs = S3FileSystem(region='eu-north-1')
>     
>         msg = ("When getting information for bucket 
> 'voltrondata-labs-datasets': "
>                r"AWS Error UNKNOWN \(HTTP status 301\) during HeadBucket "
>                "operation: No response body. Looks like the configured region 
> is "
>                "'eu-north-1' while the bucket is located in 'us-east-2'."
>                "|NETWORK_CONNECTION")
>         with pytest.raises(OSError, match=msg) as exc:
>             fs.get_file_info("voltrondata-labs-datasets")
>     
>         # Sometimes fails on unrelated network error, so next call would also 
> fail.
>         if 'NETWORK_CONNECTION' in str(exc.value):
>             return
>     
>         fs = S3FileSystem(region='us-east-2')
> >       
> > fs.get_file_info("voltrondata-labs-datasets")opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_fs.py:1339:
> >  
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> pyarrow/_fs.pyx:571: in pyarrow._fs.FileSystem.get_file_info
>     ???
> pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status
>     ???
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ >   ???
> E   OSError: When getting information for bucket 'voltrondata-labs-datasets': 
> AWS Error ACCESS_DENIED during HeadBucket operation: No response body. {code}
> I can't seem to be able to reproduce locally but is pretty consistent:
>  * 
> [test-conda-python-3.10|https://github.com/ursacomputing/crossbow/actions/runs/3528202639/jobs/5918051269]
>  * 
> [test-conda-python-3.11|https://github.com/ursacomputing/crossbow/actions/runs/3528201175/jobs/5918048135]
>  * 
> [test-conda-python-3.7|https://github.com/ursacomputing/crossbow/actions/runs/3528195566/jobs/5918035812]
>  * 
> [test-conda-python-3.7-pandas-latest|https://github.com/ursacomputing/crossbow/actions/runs/3528211334/jobs/5918069152]
>  * 
> [test-conda-python-3.8|https://github.com/ursacomputing/crossbow/actions/runs/3528193702/jobs/5918032370]
>  * 
> [test-conda-python-3.8-pandas-latest|https://github.com/ursacomputing/crossbow/actions/runs/3528213536/jobs/5918073481]
>  * 
> [test-conda-python-3.8-pandas-nightly|https://github.com/ursacomputing/crossbow/actions/runs/3528205157/jobs/5918056277]
>  * 
> [test-conda-python-3.9|https://github.com/ursacomputing/crossbow/actions/runs/3528202402/jobs/5918050613]
>  * 
> [test-conda-python-3.9-pandas-upstream_devel|https://github.com/ursacomputing/crossbow/actions/runs/3528210560/jobs/5918067302]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18390) [CI][Python] Nightly python test for spark master missing test module

2022-11-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18390:
---
Labels: Nightly pull-request-available  (was: Nightly)

> [CI][Python] Nightly python test for spark master missing test module
> -
>
> Key: ARROW-18390
> URL: https://issues.apache.org/jira/browse/ARROW-18390
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: Nightly, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently the nightly test with spark master 
> [test-conda-python-3.9-spark-master|[https://github.com/ursacomputing/crossbow/actions/runs/3528196313/jobs/5918037939]]
>   fail with:
> {code:java}
> Starting test(python): pyspark.sql.tests.test_pandas_map (temp output: 
> /spark/python/target/cbca1b18-4af7-4205-aa41-8c945bf1cf58/python__pyspark.sql.tests.test_pandas_map__9ptzo8sa.log)
> /opt/conda/envs/arrow/bin/python: No module named 
> pyspark.sql.tests.test_pandas_grouped_map {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18389) [CI][Python] Update nightly test-conda-python-3.7-pandas-0.24 to pandas >= 1.0

2022-11-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18389:
---
Labels: pull-request-available  (was: )

> [CI][Python] Update nightly test-conda-python-3.7-pandas-0.24 to pandas >= 1.0
> --
>
> Key: ARROW-18389
> URL: https://issues.apache.org/jira/browse/ARROW-18389
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://issues.apache.org/jira/browse/ARROW-18173 Removed support for pandas 
> < 1.0. We should upgrade the nightly test.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18384) [Release][MSYS2] Show pull request title

2022-11-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18384:
---
Labels: pull-request-available  (was: )

> [Release][MSYS2] Show pull request title
> 
>
> Key: ARROW-18384
> URL: https://issues.apache.org/jira/browse/ARROW-18384
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18292) [Release][Python] Upload .wheel/.tar.gz for release not RC

2022-11-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18292:
---
Labels: pull-request-available  (was: )

> [Release][Python] Upload .wheel/.tar.gz for release not RC
> --
>
> Key: ARROW-18292
> URL: https://issues.apache.org/jira/browse/ARROW-18292
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{dev/release/post-09-python.sh}} uploads {{.wheel}}/{{.tar.gz}} for RC ( 
> https://apache.jfrog.io/ui/native/arrow/python-rc/  ) not release ( 
> https://apache.jfrog.io/ui/native/arrow/python/ ) . They are the same content 
> (because we copy artifacts of passed RC to release) but we should upload 
> {{.wheel}}/{{.tar.gz}} for release to clarify that we use vote passed 
> artifacts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18383) [C++] Avoid global variables for thread pools and at-fork handlers

2022-11-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18383:
---
Labels: pull-request-available  (was: )

> [C++] Avoid global variables for thread pools and at-fork handlers
> --
>
> Key: ARROW-18383
> URL: https://issues.apache.org/jira/browse/ARROW-18383
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Investigation revealed an issue where the global IO thread pool could be 
> constructed before the at-fork handler internal state. The IO thread pool, 
> created on library load, would register an at-fork handler; then, the at-fork 
> handler state would be initialized and clobber the handler registered just 
> before.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15812) [R] Allow user to supply col_names argument when reading in a CSV dataset

2022-11-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15812:
---
Labels: pull-request-available  (was: )

> [R] Allow user to supply col_names argument when reading in a CSV dataset
> -
>
> Key: ARROW-15812
> URL: https://issues.apache.org/jira/browse/ARROW-15812
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Allow the user to supply the {{col_names}} argument from {{readr}} when 
> reading in a dataset.  
> This is already possible when reading in a single CSV file via 
> {{arrow::read_csv_arrow()}} via the {{readr_to_csv_read_options}} function, 
> and so once the C++ functionality to autogenerate column names for Datasets 
> is implemented, we should hook up {{readr_to_csv_read_options}} in 
> {{csv_file_format_read_opts}} just like we have with 
> {{readr_to_csv_parse_options}} in {{csv_file_format_parse_options}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18111) [Go] Remaining Scalar Binary Arithmetic (bitwise, shifts)

2022-11-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18111:
---
Labels: pull-request-available  (was: )

> [Go] Remaining Scalar Binary Arithmetic (bitwise, shifts)
> -
>
> Key: ARROW-18111
> URL: https://issues.apache.org/jira/browse/ARROW-18111
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18382) [C++] "ADDRESS_SANITIZER" not defined in fuzzing builds

2022-11-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18382:
---
Labels: pull-request-available  (was: )

> [C++] "ADDRESS_SANITIZER" not defined in fuzzing builds
> ---
>
> Key: ARROW-18382
> URL: https://issues.apache.org/jira/browse/ARROW-18382
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Fuzzing builds (as run by OSS-Fuzz) enable Address Sanitizer through their 
> own set of options rather than by enabling {{ARROW_USE_ASAN}}. However, we 
> need to be informed this situation in the Arrow source code.
> One example of where this matters is that eternal thread pools produce 
> spurious leaks at shutdown because of the vector of at-fork handlers; it 
> therefore needs to be worked around on those builds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18360) [Python] Incorrectly passing schema=None to do_put crashes

2022-11-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18360:
---
Labels: pull-request-available  (was: )

> [Python] Incorrectly passing schema=None to do_put crashes
> --
>
> Key: ARROW-18360
> URL: https://issues.apache.org/jira/browse/ARROW-18360
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: Bryan Cutler
>Assignee: Miles Granger
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In pyarrow.flight, putting an incorrect value of None for schema in do_put 
> will lead to a core dump.
> In pyarrow 9.0.0, trying to enter the command leads to a 
> {code}
> In [3]: writer, reader = 
> client.do_put(flight.FlightDescriptor.for_command(cmd), schema=None)
> Segmentation fault (core dumped)
> {code}
> In pyarrow 7.0.0, the kernel crashes after attempting to access the writer 
> and I got the following:
> {code}
> In [38]: client = flight.FlightClient('grpc+tls://localhost:9643', 
> disable_server_verification=True)
> In [39]: writer, reader = 
> client.do_put(flight.FlightDescriptor.for_command(cmd), None)
> In [40]: 
> writer./home/conda/feedstock_root/build_artifacts/arrow-cpp-ext_1644752264449/work/cpp/src/arrow/flight/client.cc:736:
>   Check failed: (batch_writer_) != (nullptr) 
> miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/../../../libarrow.so.700(+0x66288c)[0x7f0feeae088c]
> miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/../../../libarrow.so.700(_ZN5arrow4util8ArrowLogD1Ev+0x101)[0x7f0feeae0c91]
> miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/../../../libarrow_flight.so.700(+0x7c1e1)[0x7f0fa9e331e1]
> miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/lib.cpython-310-x86_64-linux-gnu.so(+0x17cf1a)[0x7f0fefe7ff1a]
> miniconda3/envs/dev/bin/python(_PyObject_GenericGetAttrWithDict+0x4f3)[0x559a7cb8da03]
> miniconda3/envs/dev/bin/python(+0x144814)[0x559a7cb8f814]
> miniconda3/envs/dev/bin/python(+0x1445bf)[0x559a7cb8f5bf]
> miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x30c)[0x559a7cb7ebcc]
> miniconda3/envs/dev/bin/python(+0x1516ac)[0x559a7cb9c6ac]
> miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348]
> miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5]
> miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf]
> miniconda3/envs/dev/bin/python(+0x1ead44)[0x559a7cc35d44]
> miniconda3/envs/dev/bin/python(+0x220397)[0x559a7cc6b397]
> miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348]
> miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5]
> miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf]
> miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348]
> miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5]
> miniconda3/envs/dev/bin/python(+0x1516ac)[0x559a7cb9c6ac]
> miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348]
> miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5]
> miniconda3/envs/dev/bin/python(+0x151ef3)[0x559a7cb9cef3]
> miniconda3/envs/dev/bin/python(+0x1ead44)[0x559a7cc35d44]
> miniconda3/envs/dev/bin/python(+0x220397)[0x559a7cc6b397]
> miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x1311)[0x559a7cb7fbd1]
> miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf]
> miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x30c)[0x559a7cb7ebcc]
> miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf]
> miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5]
> miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf]
> miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x66f)[0x559a7cb7ef2f]
> miniconda3/envs/dev/bin/python(+0x14fc9d)[0x559a7cb9ac9d]
> miniconda3/envs/dev/bin/python(_PyObject_GenericGetAttrWithDict+0x4f3)[0x559a7cb8da03]
> miniconda3/envs/dev/bin/python(PyObject_GetAttr+0x44)[0x559a7cb8c494]
> miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x4d8f)[0x559a7cb8364f]
> miniconda3/envs/dev/bin/python(+0x14fc9d)[0x559a7cb9ac9d]
> miniconda3/envs/dev/bin/python(+0x1416f5)[0x559a7cb8c6f5]
> miniconda3/envs/dev/bin/python(PyObject_GetAttr+0x52)[0x559a7cb8c4a2]
> miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x4d8f)[0x559a7cb8364f]
> miniconda3/envs/dev/bin/python(+0x14fc9d)[0x559a7cb9ac9d]
> miniconda3/envs/dev/bin/python(_PyObject_GenericGetAttrWithDict+0x4f3)[0x559a7cb8da03]
> miniconda3/envs/dev/bin/python(PyObject_GetAttr+0x44)[0x559a7cb8c494]
> 

[jira] [Updated] (ARROW-18265) [C++] Allow FieldPath to work with ListElement

2022-11-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18265:
---
Labels: pull-request-available  (was: )

> [C++] Allow FieldPath to work with ListElement
> --
>
> Key: ARROW-18265
> URL: https://issues.apache.org/jira/browse/ARROW-18265
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Miles Granger
>Assignee: Miles Granger
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{FieldRef::FromDotPath}} can parse a single list element field. ie. 
> {{{}'path.to.list[0]`{}}}but does not work in practice. Failing with:
> _struct_field: cannot subscript field of type list<>_
> Being able to add a slice or multiple list elements is not within the scope 
> of this issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18282) [C++][Python] Support step slicing in list_slice kernel

2022-11-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18282:
---
Labels: C++ Python pull-request-available  (was: C++ Python)

> [C++][Python] Support step slicing in list_slice kernel
> ---
>
> Key: ARROW-18282
> URL: https://issues.apache.org/jira/browse/ARROW-18282
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Miles Granger
>Assignee: Miles Granger
>Priority: Major
>  Labels: C++, Python, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [GitHub PR 14395 | https://github.com/apache/arrow/pull/14395] adds the 
> {{list_slice}} kernel, but does not implement the case where {{step != 1}}, 
> which should implement step slicing other than 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18379) [Python] Change warnings to _warnings in _plasma_store_entry_point

2022-11-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18379:
---
Labels: pull-request-available  (was: )

> [Python] Change warnings to _warnings in _plasma_store_entry_point
> --
>
> Key: ARROW-18379
> URL: https://issues.apache.org/jira/browse/ARROW-18379
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.2, 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is a {{leftover in python/pyarrow/__init__.py}} from 
> [https://github.com/apache/arrow/pull/14343] due to {{warnings}} being 
> imported as {{_warnings}}.
> Connected GitHub issue: [https://github.com/apache/arrow/issues/14693]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18303) [Go] Missing tag for compute module

2022-11-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18303:
---
Labels: pull-request-available  (was: )

> [Go] Missing tag for compute module
> ---
>
> Key: ARROW-18303
> URL: https://issues.apache.org/jira/browse/ARROW-18303
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Affects Versions: 10.0.0
>Reporter: Lilian Maurel
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 1h
>  Time Spent: 10m
>  Remaining Estimate: 50m
>
> Since https://issues.apache.org/jira/browse/ARROW-17456 compute is separate 
> to a separate module.
>  
> import change to github.com/apache/arrow/go/v9/arrow/compute to 
> github.com/apache/arrow/go/arrow/compute/v10
>  
> Tag go/arrow/compute/v10.0.0 must be create for go mod resolution
>  
> Also in go.mod
> line module github.com/apache/arrow/go/v10/arrow/compute
> must be change by module github.com/apache/arrow/go/arrow/compute/v10



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18374) [Go][CI][Benchmarks] Fix Go Bench Script after conbench change

2022-11-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18374:
---
Labels: pull-request-available  (was: )

> [Go][CI][Benchmarks] Fix Go Bench Script after conbench change
> --
>
> Key: ARROW-18374
> URL: https://issues.apache.org/jira/browse/ARROW-18374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Benchmarking, Continuous Integration, Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Change [https://github.com/conbench/conbench/pull/417/files#] requires now 
> putting an explicit {{github=None}} as an argument to {{BenchmarkResult}} to 
> have it get the github info from the locally cloned repo.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18373) MIGRATION: Enable multiple component selection in issue templates

2022-11-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18373:
---
Labels: pull-request-available  (was: )

> MIGRATION: Enable multiple component selection in issue templates
> -
>
> Key: ARROW-18373
> URL: https://issues.apache.org/jira/browse/ARROW-18373
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Assignee: Todd Farmer
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Per comments in [this merged PR|https://github.com/apache/arrow/pull/14675], 
> we would like to enable selection of multiple components when reporting 
> issues via GitHub issues.
> Additionally, we may want to add the needed Apache license to the issue 
> templates and remove the exclusion rules from rat_exclude_files.txt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18363) [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)

2022-11-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18363:
---
Labels: pull-request-available  (was: )

> [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)
> -
>
> Key: ARROW-18363
> URL: https://issues.apache.org/jira/browse/ARROW-18363
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now we have versioned docs, we also have the old versions of the developers 
> docs (eg 
> https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those 
> might be outdated (eg regarding communication channels, build instructions, 
> etc), and typically when contributing / developing with the latest arrow, one 
> should _always_ check the latest dev version of the contributing docs.
> We could add a warning box pointing this out and linking to the dev docs. 
> For example similarly how some projects warn about viewing old docs in 
> general and point to the stable docs (eg https://mne.tools/1.1/index.html or 
> https://scikit-learn.org/1.0/user_guide.html). In this case we could have a 
> custom box when at a page in /developers to point to the dev docs instead of 
> stable docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17676) [C++] [Python] User-defined tabular functions

2022-11-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17676:
---
Labels: pull-request-available  (was: )

> [C++] [Python] User-defined tabular functions
> -
>
> Key: ARROW-17676
> URL: https://issues.apache.org/jira/browse/ARROW-17676
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, only a stateless user-defined function is supported in PyArrow. 
> This issue will add support for a user-defined tabular function, which is a 
> user-function implemented in Python that returns a stateful stream of tabular 
> data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18366) [Packaging][RPM][Gandiva] Failed to link on AlmaLinux 9

2022-11-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18366:
---
Labels: pull-request-available  (was: )

> [Packaging][RPM][Gandiva] Failed to link on AlmaLinux 9 
> 
>
> Key: ARROW-18366
> URL: https://issues.apache.org/jira/browse/ARROW-18366
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/ursacomputing/crossbow/actions/runs/3502784911/jobs/5867407921#step:6:4748
> {noformat}
> FAILED: gandiva-glib/Gandiva-1.0.gir 
> env 
> PKG_CONFIG_PATH=/usr/lib64/pkgconfig:/usr/share/pkgconfig:/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/meson-uninstalled
>  /usr/bin/g-ir-scanner --quiet --no-libtool --namespace=Gandiva 
> --nsversion=1.0 --warn-all --output gandiva-glib/Gandiva-1.0.gir 
> --c-include=gandiva-glib/gandiva-glib.h --warn-all 
> --include-uninstalled=./arrow-glib/Arrow-1.0.gir 
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/gandiva-glib 
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/gandiva-glib 
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/. 
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/. 
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/redhat-linux-build/src
>  
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/redhat-linux-build/src
>  -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/src 
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/src 
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/. 
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/. 
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/redhat-linux-build/src
>  
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/redhat-linux-build/src
>  -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/src 
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/src 
> --filelist=/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/gandiva-glib/libgandiva-glib.so.1100.0.0.p/Gandiva_1.0_gir_filelist
>  --include=Arrow-1.0 --symbol-prefix=ggandiva --identifier-prefix=GGandiva 
> --pkg-export=gandiva-glib --cflags-begin 
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/. 
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/. 
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/redhat-linux-build/src
>  
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/redhat-linux-build/src
>  -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/../cpp/src 
> -I/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../cpp/src 
> -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include 
> -I/usr/include/sysprof-4 -I/usr/include/gobject-introspection-1.0 
> --cflags-end 
> --add-include-path=/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/arrow-glib
>  --add-include-path=/usr/share/gir-1.0 
> -L/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/gandiva-glib 
> --library gandiva-glib 
> -L/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/arrow-glib 
> -L/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../../cpp/redhat-linux-build/release
>  --extra-library=gobject-2.0 --extra-library=glib-2.0 
> --extra-library=girepository-1.0 --sources-top-dirs 
> /build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/ --sources-top-dirs 
> /build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/ --warn-error
> /usr/bin/ld: 
> /build/rpmbuild/BUILD/apache-arrow-11.0.0.dev130/c_glib/build/../../cpp/redhat-linux-build/release/libgandiva.so.1100:
>  undefined reference to `std::__glibcxx_assert_fail(char const*, int, char 
> const*, char const*)'
> collect2: error: ld returned 1 exit status
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15470) [R] Allows user to specify string to be used for missing data when writing CSV dataset

2022-11-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15470:
---
Labels: pull-request-available  (was: )

> [R] Allows user to specify string to be used for missing data when writing 
> CSV dataset
> --
>
> Key: ARROW-15470
> URL: https://issues.apache.org/jira/browse/ARROW-15470
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Will Jones
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The ability to select the string to be used for missing data was implemented 
> for the CSV Writer in ARROW-14903 and as David Li points out below, is 
> available, so I think we just need to hook it up on the R side.
> This requires the values passed in as the "na" argument to be instead passed 
> through to "null_strings", similarly to what has been done with "skip" and 
> "skip_rows" in ARROW-15743.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18345) [R] Create a CRAN-specific packaging checklist that lives in the R package directory

2022-11-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18345:
---
Labels: pull-request-available  (was: )

> [R] Create a  CRAN-specific packaging checklist that lives in the R package 
> directory
> -
>
> Key: ARROW-18345
> URL: https://issues.apache.org/jira/browse/ARROW-18345
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Like other packaging tasks, the CRAN packaging task (which is concerned with 
> making sure the R package from the Arrow release complies with CRAN policies) 
> for the R package is slightly different than the overall Arrow release task 
> for the R package. For example, we often push patch-patch releases if the 
> two-week window we get to "safely retain the package on CRAN" does not line 
> up with a release vote. [~npr] has heroically been doing this for a long 
> time, and while he has equally heroically volunteered to keep doing it, I am 
> hoping to process of codifying this somewhere in the R repo will help a wider 
> set of contributors understand the process (even if it was already documented 
> elsewhere!).
> [~stephhazlitt] and I use {{usethis::use_release_issue()}} to manage our 
> personal R package releases, and I'm wondering if creating a similar function 
> or markdown template would help here.
> I'm happy to start the process of putting a PR up for discussion!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18362) [Parquet][C++] Accelerate bit-packing decoding with AVX-512

2022-11-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18362:
---
Labels: pull-request-available  (was: )

> [Parquet][C++] Accelerate bit-packing decoding with AVX-512
> ---
>
> Key: ARROW-18362
> URL: https://issues.apache.org/jira/browse/ARROW-18362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: zhaoyaqi
>Assignee: zhaoyaqi
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Accelerate Parquet bit-packing decoding with AVX-512 instructions?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18323) MIGRATION TEST ISSUE #2

2022-11-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18323:
---
Labels: pull-request-available  (was: )

> MIGRATION TEST ISSUE #2
> ---
>
> Key: ARROW-18323
> URL: https://issues.apache.org/jira/browse/ARROW-18323
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Assignee: Todd Farmer
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This issue was created to help test migration-related process and tooling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18361) [CI][Conan] Merge upstream changes

2022-11-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18361:
---
Labels: pull-request-available  (was: )

> [CI][Conan] Merge upstream changes
> --
>
> Key: ARROW-18361
> URL: https://issues.apache.org/jira/browse/ARROW-18361
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Updated: https://github.com/conan-io/conan-center-index/pull/14111



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18110) [Go] Scalar Comparisons

2022-11-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18110:
---
Labels: pull-request-available  (was: )

> [Go] Scalar Comparisons
> ---
>
> Key: ARROW-18110
> URL: https://issues.apache.org/jira/browse/ARROW-18110
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18349) [CI][C++][Flight] Exercise UCX on CI

2022-11-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18349:
---
Labels: pull-request-available  (was: )

> [CI][C++][Flight] Exercise UCX on CI
> 
>
> Key: ARROW-18349
> URL: https://issues.apache.org/jira/browse/ARROW-18349
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Continuous Integration, FlightRPC
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> UCX doesn't seem enabled on any CI configuration for now.
> We should have at least a nightly job with UCX enabled, for example one of 
> the Conda or Ubuntu builds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18350) [C++] Use std::to_chars instead of std::to_string

2022-11-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18350:
---
Labels: pull-request-available  (was: )

> [C++] Use std::to_chars instead of std::to_string
> -
>
> Key: ARROW-18350
> URL: https://issues.apache.org/jira/browse/ARROW-18350
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{std::to_chars}} is locale-independent unlike {{std::to_string}}; it may 
> also be faster in some cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >