[jira] [Updated] (ARROW-18253) [C++][Parquet] Improve bounds checking on some inputs

2022-11-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18253:
---
Labels: pull-request-available  (was: )

> [C++][Parquet] Improve bounds checking on some inputs
> -
>
> Key: ARROW-18253
> URL: https://issues.apache.org/jira/browse/ARROW-18253
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In some cases we don't check for lower bound of 0, on some non-performance 
> critical paths we only have DCHECKs, and while unlikely in some cases we cast 
> from size_t to int32 which can overflow, adding some safety checks here would 
> be useful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18253) [C++][Parquet] Improve bounds checking on some inputs

2022-11-04 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-18253:
---

 Summary: [C++][Parquet] Improve bounds checking on some inputs
 Key: ARROW-18253
 URL: https://issues.apache.org/jira/browse/ARROW-18253
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Micah Kornfield
Assignee: Micah Kornfield


In some cases we don't check for lower bound of 0, on some non-performance 
critical paths we only have DCHECKs, and while unlikely in some cases we cast 
from size_t to int32 which can overflow, adding some safety checks here would 
be useful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18248) [CI][Release] Use GitHub token to avoid API rate limit

2022-11-04 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-18248.
--
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14588
[https://github.com/apache/arrow/pull/14588]

> [CI][Release] Use GitHub token to avoid API rate limit
> --
>
> Key: ARROW-18248
> URL: https://issues.apache.org/jira/browse/ARROW-18248
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> e.g.:
> https://github.com/apache/arrow/actions/runs/3387695588/jobs/5628769268#step:7:25
> {noformat}
> Error: test_vote(SourceTest): OpenURI::HTTPError: 403 rate limit exceeded
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17820) [C++] Implement arithmetic kernels on List(number)

2022-11-04 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629225#comment-17629225
 ] 

Weston Pace edited comment on ARROW-17820 at 11/4/22 11:24 PM:
---

{quote}
Such an approach doesn't really fit our kernels / Acero, I think. One option 
could be to have a generic kernel to "map" another kernel on the list values. 
where you can pass the function name you want to apply, and a FunctionOptions 
object matching the kernel. Would something like this be possible technically?
{quote}

Yes, I think that should be possible for unary kernels.  Though I think mapping 
a single kernel (as opposed to a single expression) might be a bit limiting, 
though maybe it isn't so bad.  For example, what if a user wants to do 
something like {{map(lambda f: f.upper() * 2, ["a", "b", "c"])}}

Another thing is that it should be valid to use n-ary functions too provided 
the other arguments are scalars.  This discussion has come up in Substrait with 
respect to lambdas (https://github.com/substrait-io/substrait/issues/349).

Perhaps the "map function" for {{List}} could be an expression bound to a 
schema of "\{item: T\}" (e.g. so you could do {{field_ref(0)}} or 
{{field_ref("item")}}).

Though if the map function is an expression then a kernel would have to execute 
an entire expression which may or may not be doable (I've reached the limit of 
my imagination for a Friday :)


was (Author: westonpace):
{quote}
Such an approach doesn't really fit our kernels / Acero, I think. One option 
could be to have a generic kernel to "map" another kernel on the list values. 
where you can pass the function name you want to apply, and a FunctionOptions 
object matching the kernel. Would something like this be possible technically?
{quote}

Yes, I think that should be possible for unary kernels.  Though I think mapping 
a single kernel (as opposed to a single expression) might be a bit limiting, 
though maybe it isn't so bad.  For example, what if a user wants to do 
something like {{map(lambda f: f.upper() * 2, ["a", "b", "c"])}}

Another thing is that it should be valid to use n-ary functions too provided 
the other arguments are scalars.  This discussion has come up in Substrait with 
respect to lambdas (https://github.com/substrait-io/substrait/issues/349).

Perhaps the "map function" for {{List}} could be an expression bound to a 
schema of "{item: T}" (e.g. so you could do {{field_ref(0)}} or 
{{field_ref("item")}}).

Though if the map function is an expression then a kernel would have to execute 
an entire expression which may or may not be doable (I've reached the limit of 
my imagination for a Friday :)

> [C++] Implement arithmetic kernels on List(number)
> --
>
> Key: ARROW-17820
> URL: https://issues.apache.org/jira/browse/ARROW-17820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Adam Lippai
>Priority: Major
>  Labels: kernel, query-engine
>
> eg. rounding in list(float64()), similar to a map or foreach



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17820) [C++] Implement arithmetic kernels on List(number)

2022-11-04 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629225#comment-17629225
 ] 

Weston Pace commented on ARROW-17820:
-

{quote}
Such an approach doesn't really fit our kernels / Acero, I think. One option 
could be to have a generic kernel to "map" another kernel on the list values. 
where you can pass the function name you want to apply, and a FunctionOptions 
object matching the kernel. Would something like this be possible technically?
{quote}

Yes, I think that should be possible for unary kernels.  Though I think mapping 
a single kernel (as opposed to a single expression) might be a bit limiting, 
though maybe it isn't so bad.  For example, what if a user wants to do 
something like {{map(lambda f: f.upper() * 2, ["a", "b", "c"])}}

Another thing is that it should be valid to use n-ary functions too provided 
the other arguments are scalars.  This discussion has come up in Substrait with 
respect to lambdas (https://github.com/substrait-io/substrait/issues/349).

Perhaps the "map function" for {{List}} could be an expression bound to a 
schema of "{item: T}" (e.g. so you could do {{field_ref(0)}} or 
{{field_ref("item")}}).

Though if the map function is an expression then a kernel would have to execute 
an entire expression which may or may not be doable (I've reached the limit of 
my imagination for a Friday :)

> [C++] Implement arithmetic kernels on List(number)
> --
>
> Key: ARROW-17820
> URL: https://issues.apache.org/jira/browse/ARROW-17820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Adam Lippai
>Priority: Major
>  Labels: kernel, query-engine
>
> eg. rounding in list(float64()), similar to a map or foreach



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18207) [Ruby] RubyGems for 10.0.0 aren't updated yet

2022-11-04 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-18207.
--
Fix Version/s: 11.0.0
   Resolution: Fixed

Published.

> [Ruby] RubyGems for 10.0.0 aren't updated yet
> -
>
> Key: ARROW-18207
> URL: https://issues.apache.org/jira/browse/ARROW-18207
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 10.0.0
>Reporter: Noah Horton
>Assignee: Kouhei Sutou
>Priority: Major
> Fix For: 11.0.0
>
>
> 10.0.0 just released, meaning that that all install scripts that use the 
> 'latest' tag are getting it.
> Yet rubygems.org is still running with the 9.0.0 version a week after 10.0.0 
> released.
> The build scripts need to start updating rubygems.org automatically, or guide 
> users to a bundler config like 
> {code:ruby}
> gem "red-arrow", github: "apache/arrow", glob: "ruby/red-arrow/*.gemspec", 
> require: "arrow", tag: 'apache-arrow-10.0.0'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18016) [CI] Add sccache to r jobs

2022-11-04 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-18016.
--
Resolution: Fixed

Issue resolved by pull request 14570
[https://github.com/apache/arrow/pull/14570]

> [CI] Add sccache to r jobs
> --
>
> Key: ARROW-18016
> URL: https://issues.apache.org/jira/browse/ARROW-18016
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Jacob Wujciak-Jens
>Assignee: Jacob Wujciak-Jens
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Building on the work in ARROW-17021 we can now activate sccache on more 
> builds and save more time! To keep the PRs reviewable I have reduced this to 
> only R jobs and will open follow-ups for the other tasks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-11-04 Thread Arthur Passos (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629192#comment-17629192
 ] 

Arthur Passos commented on ARROW-17459:
---

Hi [~willjones127] . I have implemented your suggestion of GetRecordBatchReader 
and, at first, things seemed to work as expected. Recently, an issue regarding 
parquet data has been reported and reverting it to the ReadRowGroup solution 
seems to address this. This might be a misuse of the arrow library on my side, 
even though I have read the API docs and it looks correct.

 

My question is pretty much: should there be difference in the output when using 
the two APIs?

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Assignee: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17487) [Python][Packaging] 3.11 wheels

2022-11-04 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-17487.
--
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14499
[https://github.com/apache/arrow/pull/14499]

> [Python][Packaging] 3.11 wheels
> ---
>
> Key: ARROW-17487
> URL: https://issues.apache.org/jira/browse/ARROW-17487
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Packaging, Python
>Reporter: Saugat Pachhai
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 12h 50m
>  Remaining Estimate: 0h
>
> Hi.
> Could you please provide wheels with Python 3.11 support?
> Python 3.11.0-rc1 was released: 
> [https://www.python.org/downloads/release/python-3110rc1/.|https://www.python.org/downloads/release/python-3110rc1/],
>  and from this release onward, there won't be any ABI changes.
> > There will be no ABI changes from this point forward in the 3.11 series and 
> > the goal is that there will be as few code changes as possible.
> So, I think it should be safe to release wheel.
> Thank you. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17820) [C++] Implement arithmetic kernels on List(number)

2022-11-04 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17820:
--
Summary: [C++] Implement arithmetic kernels on List(number)  (was: 
Implement arithmetic kernels on List(number))

> [C++] Implement arithmetic kernels on List(number)
> --
>
> Key: ARROW-17820
> URL: https://issues.apache.org/jira/browse/ARROW-17820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Adam Lippai
>Priority: Major
>  Labels: kernel, query-engine
>
> eg. rounding in list(float64()), similar to a map or foreach



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17820) Implement arithmetic kernels on List(number)

2022-11-04 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629186#comment-17629186
 ] 

Joris Van den Bossche commented on ARROW-17820:
---

It would be nice if we would have a way that all unary scalar kernels could be 
applied on list arrays (indeed by being applied to the single child array of 
flat values). 

I think in SQL one could do this with a subquery with unnesting and aggregating 
again (eg 
https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#creating_arrays_from_subqueries,
 although that example is actually not a unary kernel but a binary).

Such an approach doesn't really fit our kernels / Acero, I think. One option 
could be to have a generic kernel to "map" another kernel on the list values. 
Like

{code}
list_map_function(list_array, "kernel_name", FunctionOptions)
{code}

where you can pass the function name you want to apply, and a FunctionOptions 
object matching the kernel. Would something like this be possible technically?

Another option could be to directly register list type for unary kernels? (in 
many cases there might be no ambiguity about that we expect the function to be 
applied to each value in the list, instead of applied to each list. For example 
for {{round(list)}} or {{ascii_lower(list)}})




> Implement arithmetic kernels on List(number)
> 
>
> Key: ARROW-17820
> URL: https://issues.apache.org/jira/browse/ARROW-17820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Adam Lippai
>Priority: Major
>  Labels: kernel, query-engine
>
> eg. rounding in list(float64()), similar to a map or foreach



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17820) Implement arithmetic kernels on List(number)

2022-11-04 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17820:
--
Labels: kernel query-engine  (was: )

> Implement arithmetic kernels on List(number)
> 
>
> Key: ARROW-17820
> URL: https://issues.apache.org/jira/browse/ARROW-17820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Adam Lippai
>Priority: Major
>  Labels: kernel, query-engine
>
> eg. rounding in list(float64()), similar to a map or foreach



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18252) Add Acero test to ensure names in the root relation of Substrain plans are retained

2022-11-04 Thread Bryce Mecum (Jira)
Bryce Mecum created ARROW-18252:
---

 Summary: Add Acero test to ensure names in the root relation of 
Substrain plans are retained
 Key: ARROW-18252
 URL: https://issues.apache.org/jira/browse/ARROW-18252
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Bryce Mecum


Currently, Acero retains the names in the root relation when it executes a 
Substrait plan but there isn't a unit test in place to prevent a future 
regression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments

2022-11-04 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-18246:
---
Component/s: Documentation

> [Python][Docs] PyArrow table join docstring typos for left and right suffix 
> arguments
> -
>
> Key: ARROW-18246
> URL: https://issues.apache.org/jira/browse/ARROW-18246
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation
>Reporter: d33bs
>Assignee: Will Jones
>Priority: Minor
>  Labels: docs-impacting, documentation, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hello, thank you for all the amazing work on Arrow! I'd like to report a 
> potential issue with PyArrow's Table Join docstring which may make it 
> confusing for others to read. This content is I believe translated into the 
> documentation website as well.
> The content which needs to be corrected may be found starting at: 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737]
> The block currently reads:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to right column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffic to add to the left column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> It could be improved with the following:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to left column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffix to add to the right column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> Please let me know if I may clarify or if there are any questions on the 
> above. Thanks again for your help!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments

2022-11-04 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones updated ARROW-18246:
---
Fix Version/s: 11.0.0

> [Python][Docs] PyArrow table join docstring typos for left and right suffix 
> arguments
> -
>
> Key: ARROW-18246
> URL: https://issues.apache.org/jira/browse/ARROW-18246
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation
>Reporter: d33bs
>Assignee: Will Jones
>Priority: Minor
>  Labels: docs-impacting, documentation, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hello, thank you for all the amazing work on Arrow! I'd like to report a 
> potential issue with PyArrow's Table Join docstring which may make it 
> confusing for others to read. This content is I believe translated into the 
> documentation website as well.
> The content which needs to be corrected may be found starting at: 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737]
> The block currently reads:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to right column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffic to add to the left column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> It could be improved with the following:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to left column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffix to add to the right column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> Please let me know if I may clarify or if there are any questions on the 
> above. Thanks again for your help!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17984) pq.read_table doesn't seem to be thread safe

2022-11-04 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629119#comment-17629119
 ] 

Antoine Pitrou commented on ARROW-17984:


[~marsupialtail] Can you post the backtrace for _all_ threads? (using "thread 
apply all bt" as suggested by [~westonpace]).

Since that will probably be quite long I suggest posting it on a site such as 
https://gist.github.com/ and posting the link here.

> pq.read_table doesn't seem to be thread safe
> 
>
> Key: ARROW-17984
> URL: https://issues.apache.org/jira/browse/ARROW-17984
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet
>Affects Versions: 9.0.0
>Reporter: Ziheng Wang
>Priority: Major
> Attachments: _usr_bin_python3.8.1000.crash
>
>
> Before PR: [https://github.com/apache/arrow/pull/13799] gets merged in 
> master, I am using multithreading to improve read bandwidth from S3. Even 
> after that PR gets merged, I probably will still try to use multithreading to 
> some extent.
> However pq.read_table from S3 doesn't seem to be thread safe. Seems like it 
> uses the new dataset reader under the hood. I cannot provide a reproduction, 
> not a stable one anyway. But this is roughly the script I have been using 
> ~~~
> def get_next_batch(self, mapper_id, pos=None):
> def download(file):
>     return pq.read_table("s3://" + self.bucket + "/" + file, 
> columns=self.columns, filters=self.filters)
>  
> executor = concurrent.futures.ThreadPoolExecutor(max_workers=self.workers)
> futures= \{executor.submit(download, file): file for file in my_files}
> for future inconcurrent.futures.as_completed(futures):
>     yield  future.result()
> ~~~
> The errors all have to do with malloc segfaults which makes me suspect the 
> connection object is being reused across different pq.read_table invocations 
> in different threads
> ```
> (InputReaderNode pid=25001, ip=172.31.60.29) malloc_consolidate(): invalid 
> chunk size
> (InputReaderNode pid=25001, ip=172.31.60.29) *** SIGABRT received at 
> time=1665464922 on cpu 9 ***
> (InputReaderNode pid=25001, ip=172.31.60.29) PC: @     0x7f9a480a803b  
> (unknown)  raise
> (InputReaderNode pid=25001, ip=172.31.60.29)     @     0x7f9a480a80c0       
> 4160  (unknown)
> (InputReaderNode pid=25001, ip=172.31.60.29)     @     0x7f9a480fa32c  
> (unknown)  (unknown)
> ```
> Note, this multithreaded code is running inside a Ray actor process, but that 
> shouldn't be a problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date

2022-11-04 Thread Lucas Mation (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629101#comment-17629101
 ] 

Lucas Mation commented on ARROW-18242:
--

ok, same error as before

 

arrow_table(x = '1976') %>% mutate(y = dmy(x)) %>% collect()
# A tibble: 1 x 2
  x        y         
          
1 1976 1975-11-30

> [R] arrow implementation of lubridate::dmy parses invalid date "1976" as 
> date
> -
>
> Key: ARROW-18242
> URL: https://issues.apache.org/jira/browse/ARROW-18242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Lucas Mation
>Priority: Critical
>
> Sorry for so many issues, but I think this is another bug.
> Wrong behavior of the arrow implementation of the  `lubridate::dmy`.
> An invalid date such as '1976' is being parsed as a valid (and completely 
> unrelated) date.
> #in R
> '1976' %>% dmy
> [1] NA
> Warning message:
>   All formats failed to parse. No formats found. 
> #In arrow
> q <- data.table(x=c('1976','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect
> q2
> x
> 1: 1975-11-30
> 2: 1976-11-30
> 3: 1976-01-01
> #notice '1976' is an invalid date. First row of x2 should be NA!!!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date

2022-11-04 Thread Nicola Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629048#comment-17629048
 ] 

Nicola Crane commented on ARROW-18242:
--

Sorry, my bad, I accidentally used {{ymd()}} there instead of {{dmy()}}.  Mind 
giving it a go with {{dmy()}}?

> [R] arrow implementation of lubridate::dmy parses invalid date "1976" as 
> date
> -
>
> Key: ARROW-18242
> URL: https://issues.apache.org/jira/browse/ARROW-18242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Lucas Mation
>Priority: Critical
>
> Sorry for so many issues, but I think this is another bug.
> Wrong behavior of the arrow implementation of the  `lubridate::dmy`.
> An invalid date such as '1976' is being parsed as a valid (and completely 
> unrelated) date.
> #in R
> '1976' %>% dmy
> [1] NA
> Warning message:
>   All formats failed to parse. No formats found. 
> #In arrow
> q <- data.table(x=c('1976','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect
> q2
> x
> 1: 1975-11-30
> 2: 1976-11-30
> 3: 1976-01-01
> #notice '1976' is an invalid date. First row of x2 should be NA!!!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date

2022-11-04 Thread Lucas Mation (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629045#comment-17629045
 ] 

Lucas Mation commented on ARROW-18242:
--

So the error only occurs when the data is being read from the parquet file, 
then collected

> [R] arrow implementation of lubridate::dmy parses invalid date "1976" as 
> date
> -
>
> Key: ARROW-18242
> URL: https://issues.apache.org/jira/browse/ARROW-18242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Lucas Mation
>Priority: Critical
>
> Sorry for so many issues, but I think this is another bug.
> Wrong behavior of the arrow implementation of the  `lubridate::dmy`.
> An invalid date such as '1976' is being parsed as a valid (and completely 
> unrelated) date.
> #in R
> '1976' %>% dmy
> [1] NA
> Warning message:
>   All formats failed to parse. No formats found. 
> #In arrow
> q <- data.table(x=c('1976','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect
> q2
> x
> 1: 1975-11-30
> 2: 1976-11-30
> 3: 1976-01-01
> #notice '1976' is an invalid date. First row of x2 should be NA!!!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date

2022-11-04 Thread Lucas Mation (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629044#comment-17629044
 ] 

Lucas Mation commented on ARROW-18242:
--

Weirdly, this returns correct output

 

arrow_table(x = '1976') %>% 
+     mutate(y = ymd(x)) %>% 
+     collect()
# A tibble: 1 x 2
  x        y     
      
1 1976 NA

> [R] arrow implementation of lubridate::dmy parses invalid date "1976" as 
> date
> -
>
> Key: ARROW-18242
> URL: https://issues.apache.org/jira/browse/ARROW-18242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Lucas Mation
>Priority: Critical
>
> Sorry for so many issues, but I think this is another bug.
> Wrong behavior of the arrow implementation of the  `lubridate::dmy`.
> An invalid date such as '1976' is being parsed as a valid (and completely 
> unrelated) date.
> #in R
> '1976' %>% dmy
> [1] NA
> Warning message:
>   All formats failed to parse. No formats found. 
> #In arrow
> q <- data.table(x=c('1976','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect
> q2
> x
> 1: 1975-11-30
> 2: 1976-11-30
> 3: 1976-01-01
> #notice '1976' is an invalid date. First row of x2 should be NA!!!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments

2022-11-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18246:
---
Labels: docs-impacting documentation pull-request-available  (was: 
docs-impacting documentation)

> [Python][Docs] PyArrow table join docstring typos for left and right suffix 
> arguments
> -
>
> Key: ARROW-18246
> URL: https://issues.apache.org/jira/browse/ARROW-18246
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: d33bs
>Assignee: Will Jones
>Priority: Minor
>  Labels: docs-impacting, documentation, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hello, thank you for all the amazing work on Arrow! I'd like to report a 
> potential issue with PyArrow's Table Join docstring which may make it 
> confusing for others to read. This content is I believe translated into the 
> documentation website as well.
> The content which needs to be corrected may be found starting at: 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737]
> The block currently reads:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to right column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffic to add to the left column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> It could be improved with the following:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to left column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffix to add to the right column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> Please let me know if I may clarify or if there are any questions on the 
> above. Thanks again for your help!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments

2022-11-04 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629039#comment-17629039
 ] 

Will Jones commented on ARROW-18246:


Thanks for reporting. I have created an update fixing those and a couple other 
issues in the docs.

> [Python][Docs] PyArrow table join docstring typos for left and right suffix 
> arguments
> -
>
> Key: ARROW-18246
> URL: https://issues.apache.org/jira/browse/ARROW-18246
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: d33bs
>Assignee: Will Jones
>Priority: Minor
>  Labels: docs-impacting, documentation
>
> Hello, thank you for all the amazing work on Arrow! I'd like to report a 
> potential issue with PyArrow's Table Join docstring which may make it 
> confusing for others to read. This content is I believe translated into the 
> documentation website as well.
> The content which needs to be corrected may be found starting at: 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737]
> The block currently reads:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to right column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffic to add to the left column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> It could be improved with the following:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to left column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffix to add to the right column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> Please let me know if I may clarify or if there are any questions on the 
> above. Thanks again for your help!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date

2022-11-04 Thread Nicola Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629025#comment-17629025
 ] 

Nicola Crane edited comment on ARROW-18242 at 11/4/22 2:54 PM:
---

I can't test this example as I don't have a Windows machine available to me 
now, but I'm guessing we get the same problem using this tiny reprex:


{code:r}
arrow_table(x = '1976') %>% 
  mutate(y = ymd(x)) %>% 
  collect()
{code}



was (Author: thisisnic):
I can't test this example as I don't have a Windows machine available to me 
now, but I'm guessing we get the same problem using this tiny reprex: 
{{arrow_table(x = '1976') %>% mutate(y = ymd(x)) %>% collect()}}

> [R] arrow implementation of lubridate::dmy parses invalid date "1976" as 
> date
> -
>
> Key: ARROW-18242
> URL: https://issues.apache.org/jira/browse/ARROW-18242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Lucas Mation
>Priority: Critical
>
> Sorry for so many issues, but I think this is another bug.
> Wrong behavior of the arrow implementation of the  `lubridate::dmy`.
> An invalid date such as '1976' is being parsed as a valid (and completely 
> unrelated) date.
> #in R
> '1976' %>% dmy
> [1] NA
> Warning message:
>   All formats failed to parse. No formats found. 
> #In arrow
> q <- data.table(x=c('1976','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect
> q2
> x
> 1: 1975-11-30
> 2: 1976-11-30
> 3: 1976-01-01
> #notice '1976' is an invalid date. First row of x2 should be NA!!!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date

2022-11-04 Thread Nicola Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629025#comment-17629025
 ] 

Nicola Crane commented on ARROW-18242:
--

I can't test this example as I don't have a Windows machine available to me 
now, but I'm guessing we get the same problem using this tiny reprex: 
{{arrow_table(x = '1976') %>% mutate(y = ymd(x)) %>% collect()}}

> [R] arrow implementation of lubridate::dmy parses invalid date "1976" as 
> date
> -
>
> Key: ARROW-18242
> URL: https://issues.apache.org/jira/browse/ARROW-18242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Lucas Mation
>Priority: Critical
>
> Sorry for so many issues, but I think this is another bug.
> Wrong behavior of the arrow implementation of the  `lubridate::dmy`.
> An invalid date such as '1976' is being parsed as a valid (and completely 
> unrelated) date.
> #in R
> '1976' %>% dmy
> [1] NA
> Warning message:
>   All formats failed to parse. No formats found. 
> #In arrow
> q <- data.table(x=c('1976','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect
> q2
> x
> 1: 1975-11-30
> 2: 1976-11-30
> 3: 1976-01-01
> #notice '1976' is an invalid date. First row of x2 should be NA!!!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date

2022-11-04 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-18242:
-
Component/s: R

> [R] arrow implementation of lubridate::dmy parses invalid date "1976" as 
> date
> -
>
> Key: ARROW-18242
> URL: https://issues.apache.org/jira/browse/ARROW-18242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Lucas Mation
>Priority: Critical
>
> Sorry for so many issues, but I think this is another bug.
> Wrong behavior of the arrow implementation of the  `lubridate::dmy`.
> An invalid date such as '1976' is being parsed as a valid (and completely 
> unrelated) date.
> #in R
> '1976' %>% dmy
> [1] NA
> Warning message:
>   All formats failed to parse. No formats found. 
> #In arrow
> q <- data.table(x=c('1976','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect
> q2
> x
> 1: 1975-11-30
> 2: 1976-11-30
> 3: 1976-01-01
> #notice '1976' is an invalid date. First row of x2 should be NA!!!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date

2022-11-04 Thread Nicola Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629023#comment-17629023
 ] 

Nicola Crane commented on ARROW-18242:
--

OK, my best guess as to what is going on here is that the original lubridate 
implementation uses a custom C parser to process these datetimes, and in the 
Arrow implementation some of this work is being done on whichever external 
library is being depended on for datetimes, which is why there's a difference 
between Windows and Linux.  We might be able to add some additional 
pre-processing steps to our bindings (or the regex the setup code for them 
produces) to prevent this.  

> [R] arrow implementation of lubridate::dmy parses invalid date "1976" as 
> date
> -
>
> Key: ARROW-18242
> URL: https://issues.apache.org/jira/browse/ARROW-18242
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Priority: Critical
>
> Sorry for so many issues, but I think this is another bug.
> Wrong behavior of the arrow implementation of the  `lubridate::dmy`.
> An invalid date such as '1976' is being parsed as a valid (and completely 
> unrelated) date.
> #in R
> '1976' %>% dmy
> [1] NA
> Warning message:
>   All formats failed to parse. No formats found. 
> #In arrow
> q <- data.table(x=c('1976','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect
> q2
> x
> 1: 1975-11-30
> 2: 1976-11-30
> 3: 1976-01-01
> #notice '1976' is an invalid date. First row of x2 should be NA!!!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments

2022-11-04 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-18246:
--

Assignee: Will Jones

> [Python][Docs] PyArrow table join docstring typos for left and right suffix 
> arguments
> -
>
> Key: ARROW-18246
> URL: https://issues.apache.org/jira/browse/ARROW-18246
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: d33bs
>Assignee: Will Jones
>Priority: Minor
>  Labels: docs-impacting, documentation
>
> Hello, thank you for all the amazing work on Arrow! I'd like to report a 
> potential issue with PyArrow's Table Join docstring which may make it 
> confusing for others to read. This content is I believe translated into the 
> documentation website as well.
> The content which needs to be corrected may be found starting at: 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737]
> The block currently reads:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to right column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffic to add to the left column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> It could be improved with the following:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to left column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffix to add to the right column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> Please let me know if I may clarify or if there are any questions on the 
> above. Thanks again for your help!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-18245) wheels for PyArrow + Python 3.11

2022-11-04 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones closed ARROW-18245.
--
Resolution: Duplicate

Hello! This is being actively worked on in ARROW-17487. I've closed this ticket 
since it duplicates that one.

> wheels for PyArrow + Python 3.11
> 
>
> Key: ARROW-18245
> URL: https://issues.apache.org/jira/browse/ARROW-18245
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 10.0.0
> Environment: Linux RH8
>Reporter: Aleksandar
>Priority: Minor
>
> Hi,
> May we know the plan for pypi pyarrow 10 package will have build dependencies 
> installed as part of the package.    Right now pyarrow10 package has no 
> wheels  for py3.11.0 .
> Maybe this is not a right forum but someone is maintaining and packaging 
> these things for developers.
> Thanks much and sorry for intruding ...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18228) AWS Error SLOW_DOWN during PutObject operation

2022-11-04 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629016#comment-17629016
 ] 

Will Jones commented on ARROW-18228:


If you are still getting errors, it might be worth reviewing how you slow you 
app down somewhat to not hit these errors.

[https://aws.amazon.com/premiumsupport/knowledge-center/http-5xx-errors-s3/]

[https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html]

[https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html]

I'm not sure if we have any other settings to limit concurrent requests or tune 
the backoff strategy, but that might be helpful for cases like this.

> AWS Error SLOW_DOWN during PutObject operation
> --
>
> Key: ARROW-18228
> URL: https://issues.apache.org/jira/browse/ARROW-18228
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 10.0.0
>Reporter: Vadym Dytyniak
>Priority: Major
>
> We use Dask to parallelise read/write operations and pyarrow to write dataset 
> from worker nodes.
> After pyarrow released version 10.0.0, our data flows automatically switched 
> to the latest version and some of them started to fail with the following 
> error:
> {code:java}
> File "/usr/local/lib/python3.10/dist-packages/org/store/storage.py", line 
> 768, in _write_partition
> ds.write_dataset(
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 988, in write_dataset
> _filesystemdataset_write(
>   File "pyarrow/_dataset.pyx", line 2859, in 
> pyarrow._dataset._filesystemdataset_write
> check_status(CFileSystemDataset.Write(c_options, c_scanner))
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: When creating key 'equities.us.level2.by_security/' in bucket 
> 'org-prod': AWS Error SLOW_DOWN during PutObject operation: Please reduce 
> your request rate. {code}
> In total flow failed many times: most failed with the error above, but one 
> failed with:
> {code:java}
> File "/usr/local/lib/python3.10/dist-packages/chronos/store/storage.py", line 
> 857, in _load_partition
> table = ds.dataset(
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 752, in dataset
> return _filesystem_dataset(source, **kwargs)
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 444, in _filesystem_dataset
> fs, paths_or_selector = _ensure_single_source(source, filesystem)
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 411, in _ensure_single_source
> file_info = filesystem.get_file_info(path)
>   File "pyarrow/_fs.pyx", line 564, in pyarrow._fs.FileSystem.get_file_info
> info = GetResultValue(self.fs.GetFileInfo(path))
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> return check_status(status)
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: When getting information for key 
> 'ns/date=2022-10-31/channel=4/feed=A/9f41f928eedc431ca695a7ffe5fc60c2-0.parquet'
>  in bucket 'org-poc': AWS Error NETWORK_CONNECTION during HeadObject 
> operation: curlCode: 28, Timeout was reached {code}
>  
> Do you have any idea what was changed for dataset write between 9.0.0 and 
> 10.0.0 to help us to fix the issue?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date

2022-11-04 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-18242:
-
Priority: Critical  (was: Major)

> [R] arrow implementation of lubridate::dmy parses invalid date "1976" as 
> date
> -
>
> Key: ARROW-18242
> URL: https://issues.apache.org/jira/browse/ARROW-18242
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Priority: Critical
>
> Sorry for so many issues, but I think this is another bug.
> Wrong behavior of the arrow implementation of the  `lubridate::dmy`.
> An invalid date such as '1976' is being parsed as a valid (and completely 
> unrelated) date.
> #in R
> '1976' %>% dmy
> [1] NA
> Warning message:
>   All formats failed to parse. No formats found. 
> #In arrow
> q <- data.table(x=c('1976','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect
> q2
> x
> 1: 1975-11-30
> 2: 1976-11-30
> 3: 1976-01-01
> #notice '1976' is an invalid date. First row of x2 should be NA!!!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18251) [CI][Python] AMD64 macOS 11 Python 3 job fails on master on pip install

2022-11-04 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629013#comment-17629013
 ] 

Joris Van den Bossche commented on ARROW-18251:
---

Not directly an idea. This build has been failing for some time (with a cython 
test failure) but it seems only recently started to fail with this installation 
issue.

>From commits on master, 4 days ago test failure: 
>https://github.com/apache/arrow/actions/runs/3363608185/jobs/5576970377
3 days ago installation failure: 
https://github.com/apache/arrow/actions/runs/3372998317/jobs/5597074003

A relevant difference is a pip 22.2.2 -> 22.3 update.

> [CI][Python] AMD64 macOS 11 Python 3 job fails on master on pip install
> ---
>
> Key: ARROW-18251
> URL: https://issues.apache.org/jira/browse/ARROW-18251
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Priority: Critical
> Fix For: 11.0.0
>
>
> Currently the job for AMD64 macOS 11 Python 3 is failing:
> [https://github.com/apache/arrow/actions/runs/3388587979/jobs/5630747309]
> with:
> {code:java}
>  + python3 -m pip install --no-deps --no-build-isolation -vv .
> ~/work/arrow/arrow/python ~/work/arrow/arrow
> Using pip 22.3 from 
> /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pip
>  (python 3.11)
> Non-user install because site-packages writeable
> Created temporary directory: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw
> Initialized build tracking at 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw
> Created build tracker: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw
> Entered build tracker: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw
> Created temporary directory: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-install-9ku2dtx5
> Created temporary directory: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-ephem-wheel-cache-za6jhm0e
> Processing /Users/runner/work/arrow/arrow/python
>   Added file:///Users/runner/work/arrow/arrow/python to build tracker 
> '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw'
>   Created temporary directory: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16
>   Preparing metadata (pyproject.toml): started
>   Running command Preparing metadata (pyproject.toml)
>   running dist_info
>   creating 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info
>   writing 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/PKG-INFO
>   writing dependency_links to 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/dependency_links.txt
>   writing entry points to 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/entry_points.txt
>   writing requirements to 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/requires.txt
>   writing top-level names to 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/top_level.txt
>   writing manifest file 
> '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/SOURCES.txt'
>   reading manifest template 'MANIFEST.in'
>   warning: no previously-included files matching '*.so' found anywhere in 
> distribution
>   warning: no previously-included files matching '*.pyc' found anywhere in 
> distribution
>   warning: no previously-included files matching '*~' found anywhere in 
> distribution
>   warning: no previously-included files matching '#*' found anywhere in 
> distribution
>   warning: no previously-included files matching '.DS_Store' found anywhere 
> in distribution
>   no previously-included directories found matching '.asv'
>   adding license file '../LICENSE.txt'
>   adding license file '../NOTICE.txt'
>   writing manifest file 
> '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/SOURCES.txt'
>   creating 
> '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow-11.0.0.dev55+g8e3a1e1b7.dist-info'
>   error: invalid command 'bdist_wheel'
>   error: subprocess-exited-with-error
>   
>   × Preparing metadata (pyproject.toml) did not run successfully.
>   │ exit code: 1
>   ╰─> See above for output.
>   
>   note: This error originates from a subprocess, and is 

[jira] [Commented] (ARROW-18251) [CI][Python] AMD64 macOS 11 Python 3 job fails on master on pip install

2022-11-04 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629008#comment-17629008
 ] 

Raúl Cumplido commented on ARROW-18251:
---

[~jorisvandenbossche] [~alenka] Any idea what might be the cause for this 
failures? I will try and investigate

> [CI][Python] AMD64 macOS 11 Python 3 job fails on master on pip install
> ---
>
> Key: ARROW-18251
> URL: https://issues.apache.org/jira/browse/ARROW-18251
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Priority: Critical
> Fix For: 11.0.0
>
>
> Currently the job for AMD64 macOS 11 Python 3 is failing:
> [https://github.com/apache/arrow/actions/runs/3388587979/jobs/5630747309]
> with:
> {code:java}
>  + python3 -m pip install --no-deps --no-build-isolation -vv .
> ~/work/arrow/arrow/python ~/work/arrow/arrow
> Using pip 22.3 from 
> /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pip
>  (python 3.11)
> Non-user install because site-packages writeable
> Created temporary directory: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw
> Initialized build tracking at 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw
> Created build tracker: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw
> Entered build tracker: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw
> Created temporary directory: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-install-9ku2dtx5
> Created temporary directory: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-ephem-wheel-cache-za6jhm0e
> Processing /Users/runner/work/arrow/arrow/python
>   Added file:///Users/runner/work/arrow/arrow/python to build tracker 
> '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw'
>   Created temporary directory: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16
>   Preparing metadata (pyproject.toml): started
>   Running command Preparing metadata (pyproject.toml)
>   running dist_info
>   creating 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info
>   writing 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/PKG-INFO
>   writing dependency_links to 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/dependency_links.txt
>   writing entry points to 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/entry_points.txt
>   writing requirements to 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/requires.txt
>   writing top-level names to 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/top_level.txt
>   writing manifest file 
> '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/SOURCES.txt'
>   reading manifest template 'MANIFEST.in'
>   warning: no previously-included files matching '*.so' found anywhere in 
> distribution
>   warning: no previously-included files matching '*.pyc' found anywhere in 
> distribution
>   warning: no previously-included files matching '*~' found anywhere in 
> distribution
>   warning: no previously-included files matching '#*' found anywhere in 
> distribution
>   warning: no previously-included files matching '.DS_Store' found anywhere 
> in distribution
>   no previously-included directories found matching '.asv'
>   adding license file '../LICENSE.txt'
>   adding license file '../NOTICE.txt'
>   writing manifest file 
> '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/SOURCES.txt'
>   creating 
> '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow-11.0.0.dev55+g8e3a1e1b7.dist-info'
>   error: invalid command 'bdist_wheel'
>   error: subprocess-exited-with-error
>   
>   × Preparing metadata (pyproject.toml) did not run successfully.
>   │ exit code: 1
>   ╰─> See above for output.
>   
>   note: This error originates from a subprocess, and is likely not a problem 
> with pip.
>   full command: /usr/local/bin/python3 
> /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pip/_vendor/pep517/in_process/_in_process.py
>  prepare_metadata_for_build_wheel 
> /var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/tmpteb6e2qe
>   cwd: /Users/runner/work/arrow/arrow/python
>   

[jira] [Created] (ARROW-18251) [CI][Python] AMD64 macOS 11 Python 3 job fails on master on pip install

2022-11-04 Thread Jira
Raúl Cumplido created ARROW-18251:
-

 Summary: [CI][Python] AMD64 macOS 11 Python 3 job fails on master 
on pip install
 Key: ARROW-18251
 URL: https://issues.apache.org/jira/browse/ARROW-18251
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Python
Reporter: Raúl Cumplido
 Fix For: 11.0.0


Currently the job for AMD64 macOS 11 Python 3 is failing:
[https://github.com/apache/arrow/actions/runs/3388587979/jobs/5630747309]

with:
{code:java}
 + python3 -m pip install --no-deps --no-build-isolation -vv .
~/work/arrow/arrow/python ~/work/arrow/arrow
Using pip 22.3 from 
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pip
 (python 3.11)
Non-user install because site-packages writeable
Created temporary directory: 
/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw
Initialized build tracking at 
/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw
Created build tracker: 
/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw
Entered build tracker: 
/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw
Created temporary directory: 
/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-install-9ku2dtx5
Created temporary directory: 
/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-ephem-wheel-cache-za6jhm0e
Processing /Users/runner/work/arrow/arrow/python
  Added file:///Users/runner/work/arrow/arrow/python to build tracker 
'/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw'
  Created temporary directory: 
/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16
  Preparing metadata (pyproject.toml): started
  Running command Preparing metadata (pyproject.toml)
  running dist_info
  creating 
/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info
  writing 
/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/PKG-INFO
  writing dependency_links to 
/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/dependency_links.txt
  writing entry points to 
/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/entry_points.txt
  writing requirements to 
/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/requires.txt
  writing top-level names to 
/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/top_level.txt
  writing manifest file 
'/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  warning: no previously-included files matching '*.so' found anywhere in 
distribution
  warning: no previously-included files matching '*.pyc' found anywhere in 
distribution
  warning: no previously-included files matching '*~' found anywhere in 
distribution
  warning: no previously-included files matching '#*' found anywhere in 
distribution
  warning: no previously-included files matching '.DS_Store' found anywhere in 
distribution
  no previously-included directories found matching '.asv'
  adding license file '../LICENSE.txt'
  adding license file '../NOTICE.txt'
  writing manifest file 
'/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/SOURCES.txt'
  creating 
'/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow-11.0.0.dev55+g8e3a1e1b7.dist-info'
  error: invalid command 'bdist_wheel'
  error: subprocess-exited-with-error
  
  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem 
with pip.
  full command: /usr/local/bin/python3 
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pip/_vendor/pep517/in_process/_in_process.py
 prepare_metadata_for_build_wheel 
/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/tmpteb6e2qe
  cwd: /Users/runner/work/arrow/arrow/python
  Preparing metadata (pyproject.toml): finished with status 'error'
error: metadata-generation-failed× Encountered error while generating package 
metadata.
╰─> See above for output.note: This is an issue with the package mentioned 
above, not pip.
hint: See above for details.
Exception information:
Traceback (most recent call last):
  File 
"/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pip/_internal/operations/build/metadata.py",
 line 35, in generate_metadata
    distinfo_dir = 

[jira] [Assigned] (ARROW-17487) [Python][Packaging] 3.11 wheels

2022-11-04 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-17487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido reassigned ARROW-17487:
-

Assignee: Raúl Cumplido

> [Python][Packaging] 3.11 wheels
> ---
>
> Key: ARROW-17487
> URL: https://issues.apache.org/jira/browse/ARROW-17487
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Packaging, Python
>Reporter: Saugat Pachhai
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 11h 40m
>  Remaining Estimate: 0h
>
> Hi.
> Could you please provide wheels with Python 3.11 support?
> Python 3.11.0-rc1 was released: 
> [https://www.python.org/downloads/release/python-3110rc1/.|https://www.python.org/downloads/release/python-3110rc1/],
>  and from this release onward, there won't be any ABI changes.
> > There will be no ABI changes from this point forward in the 3.11 series and 
> > the goal is that there will be as few code changes as possible.
> So, I think it should be safe to release wheel.
> Thank you. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date

2022-11-04 Thread Nicola Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628993#comment-17628993
 ] 

Nicola Crane commented on ARROW-18242:
--

Aha, this could be related to Windows as we've had some issues in the past 
there. I'll keep digging.

> [R] arrow implementation of lubridate::dmy parses invalid date "1976" as 
> date
> -
>
> Key: ARROW-18242
> URL: https://issues.apache.org/jira/browse/ARROW-18242
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Priority: Major
>
> Sorry for so many issues, but I think this is another bug.
> Wrong behavior of the arrow implementation of the  `lubridate::dmy`.
> An invalid date such as '1976' is being parsed as a valid (and completely 
> unrelated) date.
> #in R
> '1976' %>% dmy
> [1] NA
> Warning message:
>   All formats failed to parse. No formats found. 
> #In arrow
> q <- data.table(x=c('1976','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect
> q2
> x
> 1: 1975-11-30
> 2: 1976-11-30
> 3: 1976-01-01
> #notice '1976' is an invalid date. First row of x2 should be NA!!!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date

2022-11-04 Thread Lucas Mation (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628973#comment-17628973
 ] 

Lucas Mation commented on ARROW-18242:
--

Weird. I can replicate the error both on arrow-10.0.0  (in the server) and 
arrow-dev-nightly-build (in my PC). Both are windows machines

 

#For the server:

``` 

#For the server:

arrow::arrow_info()
Arrow package version: 10.0.0

Capabilities:
               
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
gcs        TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc  FALSE
mimalloc   TRUE

Arrow options():
                       
arrow.use_threads FALSE

Memory:
                  
Allocator mimalloc
Current    1.12 Kb
Max       25.77 Kb

Runtime:
                        
SIMD Level          avx2
Detected SIMD Level avx2

Build:
                                                             
C++ Library Version                                    10.0.0
C++ Compiler                                              GNU
C++ Compiler Version                                   10.3.0
Git ID               aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0

``` 

On my PC

``` 

#For the PC:

arrow_info()
Arrow package version: 10.0.0.10050

Capabilities:
               
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
gcs        TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc  FALSE
mimalloc   TRUE

Arrow options():
                       
arrow.use_threads FALSE

Memory:
                   
Allocator  mimalloc
Current   128 bytes
Max        25.52 Kb

Runtime:
                        
SIMD Level          avx2
Detected SIMD Level avx2

Build:
                                                             
C++ Library Version                           11.0.0-SNAPSHOT
C++ Compiler                                              GNU
C++ Compiler Version                                   10.3.0
Git ID               5e53978b56aa13f9c033f2e849cc22f2aed6e2d3

 

``` 

 

 

#For the server:

 

 

> [R] arrow implementation of lubridate::dmy parses invalid date "1976" as 
> date
> -
>
> Key: ARROW-18242
> URL: https://issues.apache.org/jira/browse/ARROW-18242
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Priority: Major
>
> Sorry for so many issues, but I think this is another bug.
> Wrong behavior of the arrow implementation of the  `lubridate::dmy`.
> An invalid date such as '1976' is being parsed as a valid (and completely 
> unrelated) date.
> #in R
> '1976' %>% dmy
> [1] NA
> Warning message:
>   All formats failed to parse. No formats found. 
> #In arrow
> q <- data.table(x=c('1976','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect
> q2
> x
> 1: 1975-11-30
> 2: 1976-11-30
> 3: 1976-01-01
> #notice '1976' is an invalid date. First row of x2 should be NA!!!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date

2022-11-04 Thread Nicola Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628963#comment-17628963
 ] 

Nicola Crane edited comment on ARROW-18242 at 11/4/22 1:07 PM:
---

I can't replicate this with 10.0.0 either.  FWIW I re-organised the code a bit 
here as the {{mutate()}} call there gave me an error.


{code:r}
library(dplyr)
library(data.table)
library(lubridate)
library(arrow)

'1976' %>% dmy()
#> Warning: All formats failed to parse. No formats found.
#> [1] NA

#In arrow
q <- data.table(x=c('1976','30111976','01011976'))
q %>% write_dataset('q')

q2 <- open_dataset('q') %>% mutate(x2=dmy(x)) %>% collect()
q2
#>   x x2
#> 1: 1976   
#> 2: 30111976 1976-11-30
#> 3: 01011976 1976-01-01
{code}

[~lucasmation] If you run the code in the way I've rewritten it above, do you 
get anything different?  Also, can you confirm which version of Arrow you are 
using? You can use {{arrow::arrow_info()}} to find it if you're not sure.


was (Author: thisisnic):
I can't replicate this with 10.0.0 either.  FWIW I re-organised the code a bit 
here as the {{mutate()}} call there gave me an error.


{code:r}
library(dplyr)
library(data.table)
library(lubridate)
library(arrow)

'1976' %>% dmy()

#In arrow
q <- data.table(x=c('1976','30111976','01011976'))
q %>% write_dataset('q')

q2 <- open_dataset('q') %>% mutate(x2=dmy(x)) %>% collect()
q2
{code}

[~lucasmation] If you run the code in the way I've rewritten it above, do you 
get anything different?  Also, can you confirm which version of Arrow you are 
using? You can use {{arrow::arrow_info()}} to find it if you're not sure.

> [R] arrow implementation of lubridate::dmy parses invalid date "1976" as 
> date
> -
>
> Key: ARROW-18242
> URL: https://issues.apache.org/jira/browse/ARROW-18242
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Priority: Major
>
> Sorry for so many issues, but I think this is another bug.
> Wrong behavior of the arrow implementation of the  `lubridate::dmy`.
> An invalid date such as '1976' is being parsed as a valid (and completely 
> unrelated) date.
> #in R
> '1976' %>% dmy
> [1] NA
> Warning message:
>   All formats failed to parse. No formats found. 
> #In arrow
> q <- data.table(x=c('1976','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect
> q2
> x
> 1: 1975-11-30
> 2: 1976-11-30
> 3: 1976-01-01
> #notice '1976' is an invalid date. First row of x2 should be NA!!!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date

2022-11-04 Thread Nicola Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628963#comment-17628963
 ] 

Nicola Crane commented on ARROW-18242:
--

I can't replicate this with 10.0.0 either.  FWIW I re-organised the code a bit 
here as the {{mutate()}} call there gave me an error.


{code:r}
library(dplyr)
library(data.table)
library(lubridate)
library(arrow)

'1976' %>% dmy()

#In arrow
q <- data.table(x=c('1976','30111976','01011976'))
q %>% write_dataset('q')

q2 <- open_dataset('q') %>% mutate(x2=dmy(x)) %>% collect()
q2
{code}

[~lucasmation] If you run the code in the way I've rewritten it above, do you 
get anything different?  Also, can you confirm which version of Arrow you are 
using? You can use {{arrow::arrow_info()}} to find it if you're not sure.

> [R] arrow implementation of lubridate::dmy parses invalid date "1976" as 
> date
> -
>
> Key: ARROW-18242
> URL: https://issues.apache.org/jira/browse/ARROW-18242
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Priority: Major
>
> Sorry for so many issues, but I think this is another bug.
> Wrong behavior of the arrow implementation of the  `lubridate::dmy`.
> An invalid date such as '1976' is being parsed as a valid (and completely 
> unrelated) date.
> #in R
> '1976' %>% dmy
> [1] NA
> Warning message:
>   All formats failed to parse. No formats found. 
> #In arrow
> q <- data.table(x=c('1976','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect
> q2
> x
> 1: 1975-11-30
> 2: 1976-11-30
> 3: 1976-01-01
> #notice '1976' is an invalid date. First row of x2 should be NA!!!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18242) [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as date

2022-11-04 Thread Nicola Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628961#comment-17628961
 ] 

Nicola Crane commented on ARROW-18242:
--

Interesting, I tried to replicate this with 9.0.0 but could not. Will try with 
10.0.0 shortly.

> [R] arrow implementation of lubridate::dmy parses invalid date "1976" as 
> date
> -
>
> Key: ARROW-18242
> URL: https://issues.apache.org/jira/browse/ARROW-18242
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Priority: Major
>
> Sorry for so many issues, but I think this is another bug.
> Wrong behavior of the arrow implementation of the  `lubridate::dmy`.
> An invalid date such as '1976' is being parsed as a valid (and completely 
> unrelated) date.
> #in R
> '1976' %>% dmy
> [1] NA
> Warning message:
>   All formats failed to parse. No formats found. 
> #In arrow
> q <- data.table(x=c('1976','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect
> q2
> x
> 1: 1975-11-30
> 2: 1976-11-30
> 3: 1976-01-01
> #notice '1976' is an invalid date. First row of x2 should be NA!!!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-18228) AWS Error SLOW_DOWN during PutObject operation

2022-11-04 Thread Vadym Dytyniak (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628957#comment-17628957
 ] 

Vadym Dytyniak edited comment on ARROW-18228 at 11/4/22 12:54 PM:
--

[~willjones127] It helped. Thanks. Do you recommend to use this strategy or it 
means that we exceed rate limit and should review our implementation?  


was (Author: JIRAUSER297843):
[~willjones127] It helped. Do you recommend to use this strategy or it means 
that we exceed rate limit and should review our implementation?  

> AWS Error SLOW_DOWN during PutObject operation
> --
>
> Key: ARROW-18228
> URL: https://issues.apache.org/jira/browse/ARROW-18228
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 10.0.0
>Reporter: Vadym Dytyniak
>Priority: Major
>
> We use Dask to parallelise read/write operations and pyarrow to write dataset 
> from worker nodes.
> After pyarrow released version 10.0.0, our data flows automatically switched 
> to the latest version and some of them started to fail with the following 
> error:
> {code:java}
> File "/usr/local/lib/python3.10/dist-packages/org/store/storage.py", line 
> 768, in _write_partition
> ds.write_dataset(
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 988, in write_dataset
> _filesystemdataset_write(
>   File "pyarrow/_dataset.pyx", line 2859, in 
> pyarrow._dataset._filesystemdataset_write
> check_status(CFileSystemDataset.Write(c_options, c_scanner))
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: When creating key 'equities.us.level2.by_security/' in bucket 
> 'org-prod': AWS Error SLOW_DOWN during PutObject operation: Please reduce 
> your request rate. {code}
> In total flow failed many times: most failed with the error above, but one 
> failed with:
> {code:java}
> File "/usr/local/lib/python3.10/dist-packages/chronos/store/storage.py", line 
> 857, in _load_partition
> table = ds.dataset(
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 752, in dataset
> return _filesystem_dataset(source, **kwargs)
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 444, in _filesystem_dataset
> fs, paths_or_selector = _ensure_single_source(source, filesystem)
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 411, in _ensure_single_source
> file_info = filesystem.get_file_info(path)
>   File "pyarrow/_fs.pyx", line 564, in pyarrow._fs.FileSystem.get_file_info
> info = GetResultValue(self.fs.GetFileInfo(path))
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> return check_status(status)
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: When getting information for key 
> 'ns/date=2022-10-31/channel=4/feed=A/9f41f928eedc431ca695a7ffe5fc60c2-0.parquet'
>  in bucket 'org-poc': AWS Error NETWORK_CONNECTION during HeadObject 
> operation: curlCode: 28, Timeout was reached {code}
>  
> Do you have any idea what was changed for dataset write between 9.0.0 and 
> 10.0.0 to help us to fix the issue?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18228) AWS Error SLOW_DOWN during PutObject operation

2022-11-04 Thread Vadym Dytyniak (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628957#comment-17628957
 ] 

Vadym Dytyniak commented on ARROW-18228:


[~willjones127] It helped. Do you recommend to use this strategy or it means 
that we exceed rate limit and should review our implementation?  

> AWS Error SLOW_DOWN during PutObject operation
> --
>
> Key: ARROW-18228
> URL: https://issues.apache.org/jira/browse/ARROW-18228
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 10.0.0
>Reporter: Vadym Dytyniak
>Priority: Major
>
> We use Dask to parallelise read/write operations and pyarrow to write dataset 
> from worker nodes.
> After pyarrow released version 10.0.0, our data flows automatically switched 
> to the latest version and some of them started to fail with the following 
> error:
> {code:java}
> File "/usr/local/lib/python3.10/dist-packages/org/store/storage.py", line 
> 768, in _write_partition
> ds.write_dataset(
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 988, in write_dataset
> _filesystemdataset_write(
>   File "pyarrow/_dataset.pyx", line 2859, in 
> pyarrow._dataset._filesystemdataset_write
> check_status(CFileSystemDataset.Write(c_options, c_scanner))
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: When creating key 'equities.us.level2.by_security/' in bucket 
> 'org-prod': AWS Error SLOW_DOWN during PutObject operation: Please reduce 
> your request rate. {code}
> In total flow failed many times: most failed with the error above, but one 
> failed with:
> {code:java}
> File "/usr/local/lib/python3.10/dist-packages/chronos/store/storage.py", line 
> 857, in _load_partition
> table = ds.dataset(
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 752, in dataset
> return _filesystem_dataset(source, **kwargs)
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 444, in _filesystem_dataset
> fs, paths_or_selector = _ensure_single_source(source, filesystem)
>   File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 
> 411, in _ensure_single_source
> file_info = filesystem.get_file_info(path)
>   File "pyarrow/_fs.pyx", line 564, in pyarrow._fs.FileSystem.get_file_info
> info = GetResultValue(self.fs.GetFileInfo(path))
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> return check_status(status)
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: When getting information for key 
> 'ns/date=2022-10-31/channel=4/feed=A/9f41f928eedc431ca695a7ffe5fc60c2-0.parquet'
>  in bucket 'org-poc': AWS Error NETWORK_CONNECTION during HeadObject 
> operation: curlCode: 28, Timeout was reached {code}
>  
> Do you have any idea what was changed for dataset write between 9.0.0 and 
> 10.0.0 to help us to fix the issue?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18244) [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel matching input types

2022-11-04 Thread Lucas Mation (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628955#comment-17628955
 ] 

Lucas Mation commented on ARROW-18244:
--

Sure, created on ticket [#18250]

> [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel 
> matching input types 
> -
>
> Key: ARROW-18244
> URL: https://issues.apache.org/jira/browse/ARROW-18244
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 11.0.0
>
>
>  
> ```
> q <- data.table(x=c('','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=ifelse(x=='',NA,x)) %>% collect
> Error in `collect()`:
> ! NotImplemented: Function 'if_else' has no kernel matching input types 
> (bool, bool, string)
> Run `rlang::last_error()` to see where the error occurred.
> ```
> In [Functions available in Arrow dplyr 
> queries|https://arrow.apache.org/docs/r/reference/acero.html] it is stated 
> that ifelse() is available, but the error above suggests otherwise.
> Update, "str_replace" is another function that is supposedly available in 
> 10.0.0 but is not (or does not seem to be):
> ```
> q2 <- 'q' %>% open_dataset %>% mutate(x2=x %>% str_replace('',NA)) %>% collect
> Error: Expression x %>% str_replace("", NA) not supported in Arrow
> Call collect() first to pull data into R.
> ```
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18250) [R] mutate(x2=x %>% str_replace('^ s*$',NA_character_)) Does not replicate behaviour of R

2022-11-04 Thread Lucas Mation (Jira)
Lucas Mation created ARROW-18250:


 Summary: [R]  mutate(x2=x %>% str_replace('^ s*$',NA_character_)) 
Does not replicate behaviour of R
 Key: ARROW-18250
 URL: https://issues.apache.org/jira/browse/ARROW-18250
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Lucas Mation


```

q <- data.table(x=c('','1','2'))
q %>% write_dataset('q')

#in R

q %>% mutate(x2=x %>% str_replace('^
s*$',NA_character_))

   x   x2
1:   
2: 1    1
3: 2    2

#in arrow

q2 <- 'q' %>% open_dataset %>% mutate(x2=x %>% str_replace('^
s*$',NA_character_)) %>% collect

q2

   x x2
1:     
2: 1  1
3: 2  2

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18244) [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel matching input types

2022-11-04 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628945#comment-17628945
 ] 

Neal Richardson commented on ARROW-18244:
-

I'm not sure that's a valid use of string replacement since that's replacing 
substrings into the elements of the character vector, not necessarily replacing 
the entire element (ifelse is what you want for that), but we can detect and 
reject that case. Could you please open a new issue for that? Thanks!

> [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel 
> matching input types 
> -
>
> Key: ARROW-18244
> URL: https://issues.apache.org/jira/browse/ARROW-18244
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 11.0.0
>
>
>  
> ```
> q <- data.table(x=c('','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=ifelse(x=='',NA,x)) %>% collect
> Error in `collect()`:
> ! NotImplemented: Function 'if_else' has no kernel matching input types 
> (bool, bool, string)
> Run `rlang::last_error()` to see where the error occurred.
> ```
> In [Functions available in Arrow dplyr 
> queries|https://arrow.apache.org/docs/r/reference/acero.html] it is stated 
> that ifelse() is available, but the error above suggests otherwise.
> Update, "str_replace" is another function that is supposedly available in 
> 10.0.0 but is not (or does not seem to be):
> ```
> q2 <- 'q' %>% open_dataset %>% mutate(x2=x %>% str_replace('',NA)) %>% collect
> Error: Expression x %>% str_replace("", NA) not supported in Arrow
> Call collect() first to pull data into R.
> ```
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18244) [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel matching input types

2022-11-04 Thread Lucas Mation (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628919#comment-17628919
 ] 

Lucas Mation commented on ARROW-18244:
--

[~npr] OK. Then, I guess we should close this issue, since ifelse if fixed in 
the dev version. 

But on str_replace (and maybe this should be a ticket of it's own) it does not 
replicate R's behaviuour in this case

```

q <- data.table(x=c('','1','2'))
q %>% write_dataset('q')

#in R

q %>% mutate(x2=x %>% str_replace('^\\s*$',NA_character_))

   x   x2
1:   
2: 1    1
3: 2    2

#in arrow

q2 <- 'q' %>% open_dataset %>% mutate(x2=x %>% 
str_replace('^\\s*$',NA_character_)) %>% collect

q2

   x x2
1:     
2: 1  1
3: 2  2

```

> [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel 
> matching input types 
> -
>
> Key: ARROW-18244
> URL: https://issues.apache.org/jira/browse/ARROW-18244
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 11.0.0
>
>
>  
> ```
> q <- data.table(x=c('','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=ifelse(x=='',NA,x)) %>% collect
> Error in `collect()`:
> ! NotImplemented: Function 'if_else' has no kernel matching input types 
> (bool, bool, string)
> Run `rlang::last_error()` to see where the error occurred.
> ```
> In [Functions available in Arrow dplyr 
> queries|https://arrow.apache.org/docs/r/reference/acero.html] it is stated 
> that ifelse() is available, but the error above suggests otherwise.
> Update, "str_replace" is another function that is supposedly available in 
> 10.0.0 but is not (or does not seem to be):
> ```
> q2 <- 'q' %>% open_dataset %>% mutate(x2=x %>% str_replace('',NA)) %>% collect
> Error: Expression x %>% str_replace("", NA) not supported in Arrow
> Call collect() first to pull data into R.
> ```
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18244) [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel matching input types

2022-11-04 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-18244.
-
Fix Version/s: 11.0.0
 Assignee: Neal Richardson
   Resolution: Fixed

> [R] `mutate(x2=ifelse(x=='',NA,x))` Error: Function 'if_else' has no kernel 
> matching input types 
> -
>
> Key: ARROW-18244
> URL: https://issues.apache.org/jira/browse/ARROW-18244
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Lucas Mation
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 11.0.0
>
>
>  
> ```
> q <- data.table(x=c('','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=ifelse(x=='',NA,x)) %>% collect
> Error in `collect()`:
> ! NotImplemented: Function 'if_else' has no kernel matching input types 
> (bool, bool, string)
> Run `rlang::last_error()` to see where the error occurred.
> ```
> In [Functions available in Arrow dplyr 
> queries|https://arrow.apache.org/docs/r/reference/acero.html] it is stated 
> that ifelse() is available, but the error above suggests otherwise.
> Update, "str_replace" is another function that is supposedly available in 
> 10.0.0 but is not (or does not seem to be):
> ```
> q2 <- 'q' %>% open_dataset %>% mutate(x2=x %>% str_replace('',NA)) %>% collect
> Error: Expression x %>% str_replace("", NA) not supported in Arrow
> Call collect() first to pull data into R.
> ```
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-13938) [C++] Date and datetime types should autocast from strings

2022-11-04 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13938.
-
Fix Version/s: 11.0.0
 Assignee: Neal Richardson
   Resolution: Fixed

> [C++] Date and datetime types should autocast from strings
> --
>
> Key: ARROW-13938
> URL: https://issues.apache.org/jira/browse/ARROW-13938
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Jonathan Keane
>Assignee: Neal Richardson
>Priority: Major
>  Labels: kernel
> Fix For: 11.0.0
>
>
> When comparing dates and datetimes, people frequently expect that a string 
> (formatted as ISO8601) will auto-cast and compare to dates and times.
> Examples in R:
> {code:r}
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> arr <- Array$create(as.Date(c("1974-04-06", "1988-05-09")))
> arr > "1980-01-01"
> #> Error: NotImplemented: Function greater has no kernel matching input types 
> (array[date32[day]], scalar[string])
> # creating the scalar as a date works, of course
> arr > Scalar$create(as.Date("1980-01-01"))
> #> Array
> #> 
> #> [
> #>   false,
> #>   true
> #> ]
> # datatimes also do not auto-cast
> arr <- Array$create(Sys.time())
> arr > "1980-01-01 00:00:00"
> #> Error: NotImplemented: Function greater has no kernel matching input types 
> (array[timestamp[us]], scalar[string])
> # or a more real-world example
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #> filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #> intersect, setdiff, setequal, union
> mtcars$date <- as.Date(c("1974-04-06", "1988-05-09"))
> ds <- InMemoryDataset$create(mtcars)
> ds %>%
>   filter(date > "1980-01-01") %>%
>   collect()
> #> Error: NotImplemented: Function greater has no kernel matching input types 
> (array[date32[day]], scalar[string])
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18241) [C++] Add cast option to return null for values that can't convert

2022-11-04 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-18241:

Component/s: C++

> [C++] Add cast option to return null for values that can't convert
> --
>
> Key: ARROW-18241
> URL: https://issues.apache.org/jira/browse/ARROW-18241
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Lucas Mation
>Priority: Major
>
> I am importing a dataset with arrow, and then converting variable types. But 
> I got an error message because the `arrow` implementation of `as.integer` 
> can't handle empty strings (which is legal in base R). Is this a bug?
> {code:r}
> #In R
> '' %>% as.integer()
> [1] NA
>  
> #in arrow
> q <- data.table(x=c('','1','2'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x=as.integer(x)) %>% collect
> Error in `collect()`:
> ! Invalid: Failed to parse string: '' as a scalar of type int32
> Run `rlang::last_error()` to see where the error occurred.
> {code}
> Update: tryed to preprocess x with `ifelse` but it also did not work.
> {code:r}
> 'q' %>% open_dataset %>% mutate(x= ifelse(x=='',NA,x)) %>% 
> mutate(x=as.integer(x)) %>% collect
> Error in `collect()`:
> ! NotImplemented: Function 'if_else' has no kernel matching input types 
> (bool, bool, string)
> Run `rlang::last_error()` to see where the error occurred.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18241) [C++] Add cast option to return null for values that can't convert

2022-11-04 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-18241:

Summary: [C++] Add cast option to return null for values that can't convert 
 (was: [R] as.integer can't handdle empty character cels (ex c('')))

> [C++] Add cast option to return null for values that can't convert
> --
>
> Key: ARROW-18241
> URL: https://issues.apache.org/jira/browse/ARROW-18241
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Lucas Mation
>Priority: Major
>
> I am importing a dataset with arrow, and then converting variable types. But 
> I got an error message because the `arrow` implementation of `as.integer` 
> can't handle empty strings (which is legal in base R). Is this a bug?
> {code:r}
> #In R
> '' %>% as.integer()
> [1] NA
>  
> #in arrow
> q <- data.table(x=c('','1','2'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x=as.integer(x)) %>% collect
> Error in `collect()`:
> ! Invalid: Failed to parse string: '' as a scalar of type int32
> Run `rlang::last_error()` to see where the error occurred.
> {code}
> Update: tryed to preprocess x with `ifelse` but it also did not work.
> {code:r}
> 'q' %>% open_dataset %>% mutate(x= ifelse(x=='',NA,x)) %>% 
> mutate(x=as.integer(x)) %>% collect
> Error in `collect()`:
> ! NotImplemented: Function 'if_else' has no kernel matching input types 
> (bool, bool, string)
> Run `rlang::last_error()` to see where the error occurred.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18241) [R] as.integer can't handdle empty character cels (ex c(''))

2022-11-04 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628905#comment-17628905
 ] 

Neal Richardson commented on ARROW-18241:
-

> "I agree this would be a nice option to have." > sure. But should be the 
> default behavior, as that is what happens in base R, no?

Sorry, that was ambiguous. We would need the C++ cast function to support an 
option to return nulls for values that can't be converted, rather than just 
error. If/when that option exists, then yes, we would make that default in R.

I'll rename this ticket to be about that feature.

> [R] as.integer can't handdle empty character cels (ex c(''))
> 
>
> Key: ARROW-18241
> URL: https://issues.apache.org/jira/browse/ARROW-18241
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Lucas Mation
>Priority: Major
>
> I am importing a dataset with arrow, and then converting variable types. But 
> I got an error message because the `arrow` implementation of `as.integer` 
> can't handle empty strings (which is legal in base R). Is this a bug?
> {code:r}
> #In R
> '' %>% as.integer()
> [1] NA
>  
> #in arrow
> q <- data.table(x=c('','1','2'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x=as.integer(x)) %>% collect
> Error in `collect()`:
> ! Invalid: Failed to parse string: '' as a scalar of type int32
> Run `rlang::last_error()` to see where the error occurred.
> {code}
> Update: tryed to preprocess x with `ifelse` but it also did not work.
> {code:r}
> 'q' %>% open_dataset %>% mutate(x= ifelse(x=='',NA,x)) %>% 
> mutate(x=as.integer(x)) %>% collect
> Error in `collect()`:
> ! NotImplemented: Function 'if_else' has no kernel matching input types 
> (bool, bool, string)
> Run `rlang::last_error()` to see where the error occurred.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18241) [R] as.integer can't handdle empty character cels (ex c(''))

2022-11-04 Thread Lucas Mation (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628902#comment-17628902
 ] 

Lucas Mation commented on ARROW-18241:
--

[~npr] , tks. 

1)

"I agree this would be a nice option to have." > sure. But should be the 
default behavior, as that is what happens in base R, no?

2) 

" it should work as you typed it on the development version" > tested. Works. 
Tks.

{~}"On the released version, you can make it work by explicitly making the NA 
be a string so the types match{~}" > tested works.

 

> [R] as.integer can't handdle empty character cels (ex c(''))
> 
>
> Key: ARROW-18241
> URL: https://issues.apache.org/jira/browse/ARROW-18241
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Lucas Mation
>Priority: Major
>
> I am importing a dataset with arrow, and then converting variable types. But 
> I got an error message because the `arrow` implementation of `as.integer` 
> can't handle empty strings (which is legal in base R). Is this a bug?
> {code:r}
> #In R
> '' %>% as.integer()
> [1] NA
>  
> #in arrow
> q <- data.table(x=c('','1','2'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x=as.integer(x)) %>% collect
> Error in `collect()`:
> ! Invalid: Failed to parse string: '' as a scalar of type int32
> Run `rlang::last_error()` to see where the error occurred.
> {code}
> Update: tryed to preprocess x with `ifelse` but it also did not work.
> {code:r}
> 'q' %>% open_dataset %>% mutate(x= ifelse(x=='',NA,x)) %>% 
> mutate(x=as.integer(x)) %>% collect
> Error in `collect()`:
> ! NotImplemented: Function 'if_else' has no kernel matching input types 
> (bool, bool, string)
> Run `rlang::last_error()` to see where the error occurred.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18241) [R] as.integer can't handdle empty character cels (ex c(''))

2022-11-04 Thread Lucas Mation (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Mation updated ARROW-18241:
-
Description: 
I am importing a dataset with arrow, and then converting variable types. But I 
got an error message because the `arrow` implementation of `as.integer` can't 
handle empty strings (which is legal in base R). Is this a bug?
{code:r}
#In R
'' %>% as.integer()

[1] NA

 

#in arrow

q <- data.table(x=c('','1','2'))
q %>% write_dataset('q')
q2 <- 'q' %>% open_dataset %>% mutate(x=as.integer(x)) %>% collect

Error in `collect()`:
! Invalid: Failed to parse string: '' as a scalar of type int32
Run `rlang::last_error()` to see where the error occurred.
{code}
Update: tryed to preprocess x with `ifelse` but it also did not work.
{code:r}
'q' %>% open_dataset %>% mutate(x= ifelse(x=='',NA,x)) %>% 
mutate(x=as.integer(x)) %>% collect
Error in `collect()`:
! NotImplemented: Function 'if_else' has no kernel matching input types (bool, 
bool, string)
Run `rlang::last_error()` to see where the error occurred.
{code}

  was:
I am importing a dataset with arrow, and then converting variable types. But I 
got an error message because the `arrow` implementation of `as.integer` can't 
handle empty strings (which is legal in base R). Is this a bug?
{code:r}
#In R
'' %>% as.integer()

[1] NA

 

#in arrow

q <- data.table(x=c('','1','2'))
q %>% write_dataset('q')
q2 <- 'q' %>% open_dataset %>% mutate(x=as.integer(x)) %>% collect

Error in `collect()`:
! Invalid: Failed to parse string: '' as a scalar of type int32
Run `rlang::last_error()` to see where the error occurred.
{code}
Update: tryed to preprocess x with `ifelse` but it also did not work.
{code:r}
paste0(p2,'/q') %>% open_dataset %>% mutate(x= ifelse(x=='',NA,x)) %>% 
mutate(x=as.integer(x)) %>% collect
Error in `collect()`:
! NotImplemented: Function 'if_else' has no kernel matching input types (bool, 
bool, string)
Run `rlang::last_error()` to see where the error occurred.
{code}


> [R] as.integer can't handdle empty character cels (ex c(''))
> 
>
> Key: ARROW-18241
> URL: https://issues.apache.org/jira/browse/ARROW-18241
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Lucas Mation
>Priority: Major
>
> I am importing a dataset with arrow, and then converting variable types. But 
> I got an error message because the `arrow` implementation of `as.integer` 
> can't handle empty strings (which is legal in base R). Is this a bug?
> {code:r}
> #In R
> '' %>% as.integer()
> [1] NA
>  
> #in arrow
> q <- data.table(x=c('','1','2'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x=as.integer(x)) %>% collect
> Error in `collect()`:
> ! Invalid: Failed to parse string: '' as a scalar of type int32
> Run `rlang::last_error()` to see where the error occurred.
> {code}
> Update: tryed to preprocess x with `ifelse` but it also did not work.
> {code:r}
> 'q' %>% open_dataset %>% mutate(x= ifelse(x=='',NA,x)) %>% 
> mutate(x=as.integer(x)) %>% collect
> Error in `collect()`:
> ! NotImplemented: Function 'if_else' has no kernel matching input types 
> (bool, bool, string)
> Run `rlang::last_error()` to see where the error occurred.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18249) Update vcpkg port to arrow 10.0.0

2022-11-04 Thread Bernhard Manfred Gruber (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bernhard Manfred Gruber updated ARROW-18249:

Summary: Update vcpkg port to arrow 10.0.0  (was: Update vcpkg port to 
arrow 10.0)

> Update vcpkg port to arrow 10.0.0
> -
>
> Key: ARROW-18249
> URL: https://issues.apache.org/jira/browse/ARROW-18249
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 10.0.0
>Reporter: Bernhard Manfred Gruber
>Priority: Minor
>
> Please update the [vcpkg|https://github.com/microsoft/vcpkg] port of arrow to 
> the newly released version 10.0.0. The current version on vcpkg is 9.0.0.
> I found this documentation on how to do it: 
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Updatingthevcpkgport
> I need this downstream to update the xsimd port on vcpkg, which I need 
> downstream in another project. Thank you!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18249) Update vcpkg port to arrow 10.0

2022-11-04 Thread Bernhard Manfred Gruber (Jira)
Bernhard Manfred Gruber created ARROW-18249:
---

 Summary: Update vcpkg port to arrow 10.0
 Key: ARROW-18249
 URL: https://issues.apache.org/jira/browse/ARROW-18249
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 10.0.0
Reporter: Bernhard Manfred Gruber


Please update the [vcpkg|https://github.com/microsoft/vcpkg] port of arrow to 
the newly released version 10.0.0. The current version on vcpkg is 9.0.0.

I found this documentation on how to do it: 
https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Updatingthevcpkgport

I need this downstream to update the xsimd port on vcpkg, which I need 
downstream in another project. Thank you!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-12739) [C++] Function to combine Arrays row-wise into ListArray

2022-11-04 Thread Jin Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jin Shang reassigned ARROW-12739:
-

Assignee: Jin Shang

> [C++] Function to combine Arrays row-wise into ListArray
> 
>
> Key: ARROW-12739
> URL: https://issues.apache.org/jira/browse/ARROW-12739
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: Jin Shang
>Priority: Major
>  Labels: kernel, query-engine
>
> Add a variadic function that would take 2+ Arrays and combine/transpose them 
> rowwise into a ListArray. For example:
>  Input:
> {code:java}
> ArrayArray
> [[
>   "foo",   "bar",
>   "push"   "pop"
> ]]
>  {code}
> Output:
> {code:java}
> ListArray>
> [
>   ["foo","bar"],
>   ["push","pop"]
> ]
> {code}
> This is similar to the StructArray constructor which takes a list of Arrays 
> and names (but in this case it would only need to take a list of Arrays).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18248) [CI][Release] Use GitHub token to avoid API rate limit

2022-11-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18248:
---
Labels: pull-request-available  (was: )

> [CI][Release] Use GitHub token to avoid API rate limit
> --
>
> Key: ARROW-18248
> URL: https://issues.apache.org/jira/browse/ARROW-18248
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> e.g.:
> https://github.com/apache/arrow/actions/runs/3387695588/jobs/5628769268#step:7:25
> {noformat}
> Error: test_vote(SourceTest): OpenURI::HTTPError: 403 rate limit exceeded
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18248) [CI][Release] Use GitHub token to avoid API rate limit

2022-11-04 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-18248:


 Summary: [CI][Release] Use GitHub token to avoid API rate limit
 Key: ARROW-18248
 URL: https://issues.apache.org/jira/browse/ARROW-18248
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


e.g.:

https://github.com/apache/arrow/actions/runs/3387695588/jobs/5628769268#step:7:25

{noformat}
Error: test_vote(SourceTest): OpenURI::HTTPError: 403 rate limit exceeded
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18185) [C++][Compute] Support KEEP_NULL option for compute::Filter

2022-11-04 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628751#comment-17628751
 ] 

Joris Van den Bossche commented on ARROW-18185:
---

> What about implementing this as an specialized optimization for if_else AAS 
> case,

That sounds good to me

> [C++][Compute] Support KEEP_NULL option for compute::Filter
> ---
>
> Key: ARROW-18185
> URL: https://issues.apache.org/jira/browse/ARROW-18185
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Jin Shang
>Assignee: Jin Shang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The current Filter implementation always drops the filtered values. In some 
> use cases, it's desirable for the output array to have the same size as the 
> inut array. So I added a new option FilterOptions::KEEP_NULL where the 
> filtered values are kept as nulls.
> For example, with input [1, 2, 3] and filter [true, false, true], the current 
> implementation will output [1, 3] and with the new option it will output [1, 
> null, 3]
> This option is simpler to implement since we only need to construct a new 
> validity bitmap and reuse the input buffers and child arrays. Except for 
> dense union arrays which don't have validity bitmaps.
> It is also faster to filter with FilterOptions::KEEP_NULL according to the 
> benchmark result in most cases. So users can choose this option for better 
> performance when dropping filtered values is not required.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-12739) [C++] Function to combine Arrays row-wise into ListArray

2022-11-04 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-12739:
--
Labels: kernel query-engine  (was: )

> [C++] Function to combine Arrays row-wise into ListArray
> 
>
> Key: ARROW-12739
> URL: https://issues.apache.org/jira/browse/ARROW-12739
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Priority: Major
>  Labels: kernel, query-engine
>
> Add a variadic function that would take 2+ Arrays and combine/transpose them 
> rowwise into a ListArray. For example:
>  Input:
> {code:java}
> ArrayArray
> [[
>   "foo",   "bar",
>   "push"   "pop"
> ]]
>  {code}
> Output:
> {code:java}
> ListArray>
> [
>   ["foo","bar"],
>   ["push","pop"]
> ]
> {code}
> This is similar to the StructArray constructor which takes a list of Arrays 
> and names (but in this case it would only need to take a list of Arrays).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)