[jira] [Created] (ARROW-18433) Optimize aggregate functions to work with batches.

2022-12-10 Thread A. Coady (Jira)
A. Coady created ARROW-18433:


 Summary: Optimize aggregate functions to work with batches.
 Key: ARROW-18433
 URL: https://issues.apache.org/jira/browse/ARROW-18433
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Affects Versions: 10.0.1
Reporter: A. Coady


Most compute functions work with the dataset api and don't load columns. But 
aggregate functions which are associative could also work: `min`, `max`, `any`, 
`all`, `sum`, `product`. Even `unique` and `value_counts`.

A couple of implementation ideas:
 * expand the dataset api to support expressions which return scalars
 * add a `BatchedArray` type which is like a `ChunkedArray` but with lazy 
loading



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18432) [Python] Array constructor doesn't support arrow scalars.

2022-12-10 Thread A. Coady (Jira)
A. Coady created ARROW-18432:


 Summary: [Python] Array constructor doesn't support arrow scalars.
 Key: ARROW-18432
 URL: https://issues.apache.org/jira/browse/ARROW-18432
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 10.0.1
Reporter: A. Coady


{code:python}
pa.array([pa.scalar(0)])
ArrowInvalid: Could not convert  with type 
pyarrow.lib.Int64Scalar: did not recognize Python value type when inferring an 
Arrow data type

pa.array([pa.scalar(0)], 'int64')
ArrowInvalid: Could not convert  with type 
pyarrow.lib.Int64Scalar: tried to convert to int64{code}
It seems odd that the array constructors don't recognize their own scalars.

In practice, a list of scalars has to be converted with `.as_py()` just to be 
converted back, and that also loses the type information.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [arrow] abduazizR opened a new issue, #14907: [R] right_join() function does not produce the expected outcome

2022-12-10 Thread GitBox


abduazizR opened a new issue, #14907:
URL: https://github.com/apache/arrow/issues/14907

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Hi,
   
   I noticed something strange today when I was using arrow datasets. I cannot 
give a reproducible example but you can get the idea from the code below. I 
have `ccaei` as an arrow dataset. When I try to use `right_join()` with an R 
tibble before using `collect()`, it gives me wrong numbers (the number of 
distinct `ENROLID` is less than that present in `outpatients`). I get the 
correct number when I use `right_join()` after using `collect()`, although this 
is computationally inefficient. Could you help me with this?
   
   This gives a really weird number
   ```
   ccaei |>  
 filter(ADMDATE >= as_date("2016-10-01")) |> 
 filter(!is.na(ENROLID)) |> 
 select(ENROLID, ADMDATE) |> 
 right_join(outpatients) |> 
 collect() |> count(ENROLID)
   ```
   
   
   This makes sense
   ```
   ccaei |>  
 filter(ADMDATE >= as_date("2016-10-01")) |> 
 filter(!is.na(ENROLID)) |> 
 select(ENROLID, ADMDATE) |> 
 collect() |> 
   right_join(outpatients) |> 
 count(ENROLID)
   ```
   
   Not sure where the mistake came from.
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] lidavidm closed issue #14890: [Java] DictionaryEncoder may leak memory when exception thrown

2022-12-10 Thread GitBox


lidavidm closed issue #14890: [Java] DictionaryEncoder may leak memory when 
exception thrown
URL: https://github.com/apache/arrow/issues/14890


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] lidavidm closed issue #14901: [Java] ListSubfieldEncoder and StructSubfieldEncoder can decode without DictionaryHashTable

2022-12-10 Thread GitBox


lidavidm closed issue #14901: [Java] ListSubfieldEncoder and 
StructSubfieldEncoder can decode without DictionaryHashTable
URL: https://github.com/apache/arrow/issues/14901


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] ericphanson closed issue #348: Install Registrator.jl github app

2022-12-10 Thread GitBox


ericphanson closed issue #348: Install Registrator.jl github app
URL: https://github.com/apache/arrow-julia/issues/348


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org