date:20221017

[jira] [Reopened] (ARROW-18067) compile error in ARM64

2022-10-17 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reopened ARROW-18067:
--

> compile error in ARM64
> --
>
> Key: ARROW-18067
> URL: https://issues.apache.org/jira/browse/ARROW-18067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 8.0.0
> Environment: Kunpeng 920 CentOS 7
>Reporter: chenqiang
>Assignee: chenqiang
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-10-15-11-08-33-142.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Compile error in ARM64，failed to verify the thirdparty boost's SHA256.
> The boost's version is boost_1_75_0.tar.gz, download address is : 
> [https://sourceforge.net/projects/boost/files/boost/1.75.0/]
> !image-2022-10-15-11-08-33-142.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18082) Check for broken links on generated sites

2022-10-17 Thread Benson Muite (Jira)

Benson Muite created ARROW-18082:


 Summary: Check for broken links on generated sites
 Key: ARROW-18082
 URL: https://issues.apache.org/jira/browse/ARROW-18082
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, R
Reporter: Benson Muite


Some of the generated sites have broken links. Many static site generation 
tools can check for these when building the site.  This should be enabled for 
the R package, and possibly for the entire documentation website. Some related 
pull requests:
https://github.com/apache/arrow/pull/14443
https://github.com/apache/arrow/pull/14437



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17983) [Parquet][C++][Python] "List index overflow" when read parquet file

2022-10-17 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619234#comment-17619234
 ] 

Micah Kornfield commented on ARROW-17983:
-

IIRC, I think offset type here is inferred from the schema (i.e. List vs 
LargeList) that we are trying to read back into, once offsets reaches int32 max 
we can't return, since the reading path doesn't support chunking at the moment.

Two options to fix this seems to either be:
1.  Infer LargeList should be used based on RowGroup/File statistics.
2. Allow overriding the schema (this might already be an option) to take 
LargeList override.
3. Modify code to allow for chunking arrays (I seem to recall this would be a 
fare amount of work based on current assumption but its been a while since I 
dug into the code).

I seem to recall someone tried prototyping 2, recently but I'm having trouble 
finding the thread/JIRA at the moment.

> [Parquet][C++][Python] "List index overflow" when read parquet file
> ---
>
> Key: ARROW-17983
> URL: https://issues.apache.org/jira/browse/ARROW-17983
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Reporter: Yibo Cai
>Priority: Major
>
> From issue https://github.com/apache/arrow/issues/14229.
> The bug looks like this:
> - create a pandas dataframe with *one column* and {{n}} rows, {{n < 
> max(int32)}}
> - each elemenet is a list with {{m}} integers, {{m * n > max(int32)}}
> - save to a parquet file
> - reading from the parquet file fails with "OSError: List index overflow"
> See comment below on details to reproudce this bug:
> https://github.com/apache/arrow/issues/14229#issuecomment-1272223773
> Tested with a small dataset, the error might come from below code.
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion.cc#L63-L64
> {{OffsetType}} is {{int32}}, but the loop is executed (and {{*offset}} is 
> incremented) {{m * n}} times which is beyond {{max(int32)}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17849) [R][Docs] Document changes due to C++17 for centos-7 users

2022-10-17 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-17849.
-
Resolution: Fixed

Issue resolved by pull request 14440
[https://github.com/apache/arrow/pull/14440]

> [R][Docs] Document changes due to C++17 for centos-7 users
> --
>
> Key: ARROW-17849
> URL: https://issues.apache.org/jira/browse/ARROW-17849
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Reporter: Jacob Wujciak-Jens
>Assignee: Neal Richardson
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> With the switch to C++17 centos 7 users need to install and enable devtoolset 
> (and possibly change makevars) to be able to compile the R package, even when 
> using the libarrow binary (see [ARROW-17594]) but that they can use 
> INSTALL_opts = "--build" to get a binary package that is installable on a 
> centos machine WITHOUT dts -> offline build section. Centos 7 RSPM is an 
> alternative source for that binary package. 
> Also add messaging in configure or build_arrow_static.sh so that if someone 
> is trying to install from source with gcc 4.8, we tell them what they need to 
> do.
> This should be documented and noted in the release notes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-15328) [C++] Streaming CSV reader missing from documentation

2022-10-17 Thread Bryce Mecum (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryce Mecum reassigned ARROW-15328:
---

Assignee: Bryce Mecum

> [C++] Streaming CSV reader missing from documentation
> -
>
> Key: ARROW-15328
> URL: https://issues.apache.org/jira/browse/ARROW-15328
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Bryce Mecum
>Priority: Major
>
> The streaming CSV reader is missing from the C++ documentation.  The table 
> reader is documented here: 
> https://arrow.apache.org/docs/cpp/api/formats.html?highlight=tablereader#_CPPv4N5arrow3csv11TableReaderE
> The references to streaming reader are missing from 
> {{docs/source/cpp/api/formats.rst}}.  In addition, we should probably add an 
> example of using the streaming CSV reader to this page: 
> https://arrow.apache.org/docs/cpp/csv.html
> We should probably also add a short paragraph describing the tradeoffs 
> between the two.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17995) [C++] arrow::json::DecimalConverter should rescale values based on the explicit_schema

2022-10-17 Thread Quanlong Huang (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619147#comment-17619147
 ] 

Quanlong Huang commented on ARROW-17995:


[~apitrou] Can we back port this to older branches like 8.0.2 and 9.0.1 ? I 
think it's a clean cherry-pick.

> [C++] arrow::json::DecimalConverter should rescale values based on the 
> explicit_schema
> --
>
> Key: ARROW-17995
> URL: https://issues.apache.org/jira/browse/ARROW-17995
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 6.0.0, 6.0.1, 6.0.2, 7.0.0, 7.0.1, 8.0.0, 8.0.1, 9.0.0
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The C++ lib doesn't read JSON decimal values correctly based on the 
> explicit_schema. This can be reproduced by this helloworld program: 
> [https://github.com/stiga-huang/arrow-helloworld/tree/d267862]
> The input JSON file has the following rows:
> {code:json}
> {"id":1,"str":"Some","price":"30.04"}
> {"id":2,"str":"data","price":"1.234"} {code}
> If we read the price column using decimal128(9, 2), the values are
> {noformat}
>   30.04,
>   12.34
> {noformat}
> If we use decimal128(9, 3) instead, the values are
> {noformat}
>   3.004,
>   1.234
> {noformat}
> The decimal type in the explicit_schema is set here: 
> https://github.com/stiga-huang/arrow-helloworld/blob/d26786270e87d9ab847658ead96a96190461b98f/json_decimal_example.cc#L38
> The cause is {{arrow::json::DecimalConverter}} doesn't rescale the value 
> based on the out_type_:
> {code:cpp}
>   Status Convert(const std::shared_ptr& in, std::shared_ptr* 
> out) override {
> if (in->type_id() == Type::NA) {
>   return MakeArrayOfNull(out_type_, in->length(), pool_).Value(out);
> }
> const auto& dict_array = GetDictionaryArray(in);
> using Builder = typename TypeTraits::BuilderType;
> Builder builder(out_type_, pool_);
> RETURN_NOT_OK(builder.Resize(dict_array.indices()->length()));
> auto visit_valid = [&builder](string_view repr) {
>   ARROW_ASSIGN_OR_RAISE(value_type value,
> 
> TypeTraits::BuilderType::ValueType::FromString(repr));
>    Should rescale the value based on out_type_ here
>   builder.UnsafeAppend(value);
>   return Status::OK();
> };
> auto visit_null = [&builder]() {
>   builder.UnsafeAppendNull();
>   return Status::OK();
> };
> RETURN_NOT_OK(VisitDictionaryEntries(dict_array, visit_valid, 
> visit_null));
> return builder.Finish(out);
>   }
> {code}
> https://github.com/apache/arrow/blob/cdd0fdf39033b9cf132a5cfc9caa5ed60713845a/cpp/src/arrow/json/converter.cc#L171-L173



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-18079) [R] Performance regressions after ARROW-12105

2022-10-17 Thread Nicola Crane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-18079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619123#comment-17619123
 ] 

Nicola Crane commented on ARROW-18079:
--

Great! Ping me again if this one does become the last blocker, and then, sure, 
we can revert ARROW-12105 if we need to, but I'm gonna see what I can do to try 
to not need to.

> [R] Performance regressions after ARROW-12105
> -
>
> Key: ARROW-18079
> URL: https://issues.apache.org/jira/browse/ARROW-18079
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Blocker
> Fix For: 10.0.0
>
>
> In ARROW-12105 the functionality implemented introduced some performance 
> regressions that we should sort out before the release.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-18079) [R] Performance regressions after ARROW-12105

2022-10-17 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-18079:


Assignee: Nicola Crane

> [R] Performance regressions after ARROW-12105
> -
>
> Key: ARROW-18079
> URL: https://issues.apache.org/jira/browse/ARROW-18079
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Blocker
> Fix For: 10.0.0
>
>
> In ARROW-12105 the functionality implemented introduced some performance 
> regressions that we should sort out before the release.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-18079) [R] Performance regressions after ARROW-12105

2022-10-17 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-18079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619122#comment-17619122
 ] 

Kouhei Sutou commented on ARROW-18079:
--

OK. That's not too late. Because we still have some blockers: 
https://cwiki.apache.org/confluence/display/ARROW/Arrow+10.0.0+Release

> [R] Performance regressions after ARROW-12105
> -
>
> Key: ARROW-18079
> URL: https://issues.apache.org/jira/browse/ARROW-18079
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Nicola Crane
>Priority: Blocker
> Fix For: 10.0.0
>
>
> In ARROW-12105 the functionality implemented introduced some performance 
> regressions that we should sort out before the release.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-18079) [R] Performance regressions after ARROW-12105

2022-10-17 Thread Nicola Crane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-18079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619121#comment-17619121
 ] 

Nicola Crane commented on ARROW-18079:
--

I was planning to take a look at it tomorrow, if that's not too late?

> [R] Performance regressions after ARROW-12105
> -
>
> Key: ARROW-18079
> URL: https://issues.apache.org/jira/browse/ARROW-18079
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Nicola Crane
>Priority: Blocker
> Fix For: 10.0.0
>
>
> In ARROW-12105 the functionality implemented introduced some performance 
> regressions that we should sort out before the release.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-18079) [R] Performance regressions after ARROW-12105

2022-10-17 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-18079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619116#comment-17619116
 ] 

Kouhei Sutou commented on ARROW-18079:
--

If we may need to take a long time to fix this, can we revert ARROW-12105 for 
10.0.0?

> [R] Performance regressions after ARROW-12105
> -
>
> Key: ARROW-18079
> URL: https://issues.apache.org/jira/browse/ARROW-18079
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Nicola Crane
>Priority: Blocker
> Fix For: 10.0.0
>
>
> In ARROW-12105 the functionality implemented introduced some performance 
> regressions that we should sort out before the release.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-18078) [Docs][R] Broken link in R documentation

2022-10-17 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-18078.
--
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14437
[https://github.com/apache/arrow/pull/14437]

> [Docs][R] Broken link in R documentation
> 
>
> Key: ARROW-18078
> URL: https://issues.apache.org/jira/browse/ARROW-18078
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, R
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Broken link in rendered site:
> https://arrow.apache.org/docs/r/articles/developers/install_details.html#using-the-r-package-with-libarrow-installed-as-a-system-package
> corresponding to the text *“Troubleshooting” section in the main installation 
> docs*
> https://github.com/apache/arrow/blob/master/r/vignettes/developers/install_details.Rmd#L121



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-18078) [Docs][R] Broken link in R documentation

2022-10-17 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-18078:
-
Summary: [Docs][R] Broken link in R documentation  (was: Broken link in R 
documentation)

> [Docs][R] Broken link in R documentation
> 
>
> Key: ARROW-18078
> URL: https://issues.apache.org/jira/browse/ARROW-18078
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, R
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Broken link in rendered site:
> https://arrow.apache.org/docs/r/articles/developers/install_details.html#using-the-r-package-with-libarrow-installed-as-a-system-package
> corresponding to the text *“Troubleshooting” section in the main installation 
> docs*
> https://github.com/apache/arrow/blob/master/r/vignettes/developers/install_details.Rmd#L121



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-18076) [Python] PyArrow cannot read from R2 (Cloudflare's S3)

2022-10-17 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-18076:
-
Summary: [Python] PyArrow cannot read from R2 (Cloudflare's S3)  (was: 
PyArrow cannot read from R2 (Cloudflare's S3))

> [Python] PyArrow cannot read from R2 (Cloudflare's S3)
> --
>
> Key: ARROW-18076
> URL: https://issues.apache.org/jira/browse/ARROW-18076
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: Ubuntu 20
>Reporter: Vedant Roy
>Priority: Major
>
> When using pyarrow to read parquet data (as part of the Ray project), I get 
> the following stracktrace:
> ```
> (_sample_piece pid=49818) Traceback (most recent call last):
> (_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 859, in 
> ray._raylet.execute_task
> (_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 863, in 
> ray._raylet.execute_task
> (_sample_piece pid=49818)   File 
> "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py",
>  line 446, in _sample_piece
> (_sample_piece pid=49818) batch = next(batches)
> (_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 3202, in 
> _iterator
> (_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 2891, in 
> pyarrow._dataset.TaggedRecordBatchIterator.__next__
> (_sample_piece pid=49818)   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
> (_sample_piece pid=49818)   File "pyarrow/error.pxi", line 114, in 
> pyarrow.lib.check_status
> (_sample_piece pid=49818) OSError: AWS Error [code 99]: curlCode: 18, 
> Transferred a partial file
> ```
> I do not get this error when using Amazon S3 for the exact same data.
> The error is coming from this line:
> https://github.com/ray-project/ray/blob/6fb605379a726d889bd25cf0ee4ed335c74408ff/python/ray/data/datasource/parquet_datasource.py#L446



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-18076) [Python] PyArrow cannot read from R2 (Cloudflare's S3)

2022-10-17 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-18076:
-
Description: 
When using pyarrow to read parquet data (as part of the Ray project), I get the 
following stracktrace:

{noformat}
(_sample_piece pid=49818) Traceback (most recent call last):
(_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 859, in 
ray._raylet.execute_task
(_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 863, in 
ray._raylet.execute_task
(_sample_piece pid=49818)   File 
"/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py",
 line 446, in _sample_piece
(_sample_piece pid=49818) batch = next(batches)
(_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 3202, in _iterator
(_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 2891, in 
pyarrow._dataset.TaggedRecordBatchIterator.__next__
(_sample_piece pid=49818)   File "pyarrow/error.pxi", line 143, in 
pyarrow.lib.pyarrow_internal_check_status
(_sample_piece pid=49818)   File "pyarrow/error.pxi", line 114, in 
pyarrow.lib.check_status
(_sample_piece pid=49818) OSError: AWS Error [code 99]: curlCode: 18, 
Transferred a partial file
{noformat}

I do not get this error when using Amazon S3 for the exact same data.
The error is coming from this line:
https://github.com/ray-project/ray/blob/6fb605379a726d889bd25cf0ee4ed335c74408ff/python/ray/data/datasource/parquet_datasource.py#L446

  was:
When using pyarrow to read parquet data (as part of the Ray project), I get the 
following stracktrace:

```
(_sample_piece pid=49818) Traceback (most recent call last):
(_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 859, in 
ray._raylet.execute_task
(_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 863, in 
ray._raylet.execute_task
(_sample_piece pid=49818)   File 
"/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py",
 line 446, in _sample_piece
(_sample_piece pid=49818) batch = next(batches)
(_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 3202, in _iterator
(_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 2891, in 
pyarrow._dataset.TaggedRecordBatchIterator.__next__
(_sample_piece pid=49818)   File "pyarrow/error.pxi", line 143, in 
pyarrow.lib.pyarrow_internal_check_status
(_sample_piece pid=49818)   File "pyarrow/error.pxi", line 114, in 
pyarrow.lib.check_status
(_sample_piece pid=49818) OSError: AWS Error [code 99]: curlCode: 18, 
Transferred a partial file
```

I do not get this error when using Amazon S3 for the exact same data.
The error is coming from this line:
https://github.com/ray-project/ray/blob/6fb605379a726d889bd25cf0ee4ed335c74408ff/python/ray/data/datasource/parquet_datasource.py#L446


> [Python] PyArrow cannot read from R2 (Cloudflare's S3)
> --
>
> Key: ARROW-18076
> URL: https://issues.apache.org/jira/browse/ARROW-18076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Ubuntu 20
>Reporter: Vedant Roy
>Priority: Major
>
> When using pyarrow to read parquet data (as part of the Ray project), I get 
> the following stracktrace:
> {noformat}
> (_sample_piece pid=49818) Traceback (most recent call last):
> (_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 859, in 
> ray._raylet.execute_task
> (_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 863, in 
> ray._raylet.execute_task
> (_sample_piece pid=49818)   File 
> "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py",
>  line 446, in _sample_piece
> (_sample_piece pid=49818) batch = next(batches)
> (_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 3202, in 
> _iterator
> (_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 2891, in 
> pyarrow._dataset.TaggedRecordBatchIterator.__next__
> (_sample_piece pid=49818)   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
> (_sample_piece pid=49818)   File "pyarrow/error.pxi", line 114, in 
> pyarrow.lib.check_status
> (_sample_piece pid=49818) OSError: AWS Error [code 99]: curlCode: 18, 
> Transferred a partial file
> {noformat}
> I do not get this error when using Amazon S3 for the exact same data.
> The error is coming from this line:
> https://github.com/ray-project/ray/blob/6fb605379a726d889bd25cf0ee4ed335c74408ff/python/ray/data/datasource/parquet_datasource.py#L446



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-18076) [Python] PyArrow cannot read from R2 (Cloudflare's S3)

2022-10-17 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-18076:
-
Component/s: Python

> [Python] PyArrow cannot read from R2 (Cloudflare's S3)
> --
>
> Key: ARROW-18076
> URL: https://issues.apache.org/jira/browse/ARROW-18076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Ubuntu 20
>Reporter: Vedant Roy
>Priority: Major
>
> When using pyarrow to read parquet data (as part of the Ray project), I get 
> the following stracktrace:
> ```
> (_sample_piece pid=49818) Traceback (most recent call last):
> (_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 859, in 
> ray._raylet.execute_task
> (_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 863, in 
> ray._raylet.execute_task
> (_sample_piece pid=49818)   File 
> "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py",
>  line 446, in _sample_piece
> (_sample_piece pid=49818) batch = next(batches)
> (_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 3202, in 
> _iterator
> (_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 2891, in 
> pyarrow._dataset.TaggedRecordBatchIterator.__next__
> (_sample_piece pid=49818)   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
> (_sample_piece pid=49818)   File "pyarrow/error.pxi", line 114, in 
> pyarrow.lib.check_status
> (_sample_piece pid=49818) OSError: AWS Error [code 99]: curlCode: 18, 
> Transferred a partial file
> ```
> I do not get this error when using Amazon S3 for the exact same data.
> The error is coming from this line:
> https://github.com/ray-project/ray/blob/6fb605379a726d889bd25cf0ee4ed335c74408ff/python/ray/data/datasource/parquet_datasource.py#L446



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-18081) [Go] Add Scalar Boolean Functions

2022-10-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18081:
---
Labels: pull-request-available  (was: )

> [Go] Add Scalar Boolean Functions
> -
>
> Key: ARROW-18081
> URL: https://issues.apache.org/jira/browse/ARROW-18081
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18081) [Go] Add Scalar Boolean Functions

2022-10-17 Thread Matthew Topol (Jira)

Matthew Topol created ARROW-18081:
-

 Summary: [Go] Add Scalar Boolean Functions
 Key: ARROW-18081
 URL: https://issues.apache.org/jira/browse/ARROW-18081
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Matthew Topol
Assignee: Matthew Topol
 Fix For: 11.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-18080) [C++] Remove gcc <= 4.9 workarounds

2022-10-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18080:
---
Labels: pull-request-available  (was: )

> [C++] Remove gcc <= 4.9 workarounds
> ---
>
> Key: ARROW-18080
> URL: https://issues.apache.org/jira/browse/ARROW-18080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Since we require gcc 7ish or greater now that we're on C++17



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17909) [Website] Arbitrarily Nested Data in Parquet and Arrow: Part 2: Encoding Structs and Lists

2022-10-17 Thread Andrew Lamb (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-17909.
-
Resolution: Fixed

Published at 
https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/

> [Website] Arbitrarily Nested Data in Parquet and Arrow: Part 2: Encoding 
> Structs and Lists
> --
>
> Key: ARROW-17909
> URL: https://issues.apache.org/jira/browse/ARROW-17909
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Priority: Major
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17907) [Website] Blog about Arrow <--> Parquet translation and nesting

2022-10-17 Thread Andrew Lamb (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619093#comment-17619093
 ] 

Andrew Lamb commented on ARROW-17907:
-

All sub parts are complete and published

> [Website] Blog about Arrow <--> Parquet translation and nesting
> ---
>
> Key: ARROW-17907
> URL: https://issues.apache.org/jira/browse/ARROW-17907
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Website
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> @tustvold has spent a significant amount of time fixing the Rust 
> implementation of the parquet <–> arrow conversion logic for all the corner 
> cases of nulls, etc. 
>  
> During that process, he observed there was a relative lack of information on 
> the topic to be found, so we would like to write some blog posts to remedy 
> that and explain the format and parquet
>  
> The basic outline is:
> Part 1: Intro / Encoding Primitive Arrays in Arrow and Parquet
> Part 2: Encoding Structs and Lists  in Arrow and Parquet
> Part 3: Encoding Arbitrary Structs of Lists, Lists of Structs in Arrow and 
> Parquet 
> !https://a.slack-edge.com/production-standard-emoji-assets/14.0/apple-medium/1f92f.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17910) [Website] Arbitrarily Nested Data in Parquet and Arrow: Part 3: Encoding Arbitrary Structs of Lists, Lists of Structs in Arrow and Parquet

2022-10-17 Thread Andrew Lamb (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-17910.
-
Resolution: Fixed

> [Website] Arbitrarily Nested Data in Parquet and Arrow: Part 3: Encoding 
> Arbitrary Structs of Lists, Lists of Structs in Arrow and Parquet
> --
>
> Key: ARROW-17910
> URL: https://issues.apache.org/jira/browse/ARROW-17910
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17907) [Website] Blog about Arrow <--> Parquet translation and nesting

2022-10-17 Thread Andrew Lamb (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-17907.
-
Resolution: Fixed

> [Website] Blog about Arrow <--> Parquet translation and nesting
> ---
>
> Key: ARROW-17907
> URL: https://issues.apache.org/jira/browse/ARROW-17907
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Website
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> @tustvold has spent a significant amount of time fixing the Rust 
> implementation of the parquet <–> arrow conversion logic for all the corner 
> cases of nulls, etc. 
>  
> During that process, he observed there was a relative lack of information on 
> the topic to be found, so we would like to write some blog posts to remedy 
> that and explain the format and parquet
>  
> The basic outline is:
> Part 1: Intro / Encoding Primitive Arrays in Arrow and Parquet
> Part 2: Encoding Structs and Lists  in Arrow and Parquet
> Part 3: Encoding Arbitrary Structs of Lists, Lists of Structs in Arrow and 
> Parquet 
> !https://a.slack-edge.com/production-standard-emoji-assets/14.0/apple-medium/1f92f.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17910) [Website] Arbitrarily Nested Data in Parquet and Arrow: Part 3: Encoding Arbitrary Structs of Lists, Lists of Structs in Arrow and Parquet

2022-10-17 Thread Andrew Lamb (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619091#comment-17619091
 ] 

Andrew Lamb commented on ARROW-17910:
-

Published at 
https://arrow.apache.org/blog/2022/10/17/arrow-parquet-encoding-part-3/

> [Website] Arbitrarily Nested Data in Parquet and Arrow: Part 3: Encoding 
> Arbitrary Structs of Lists, Lists of Structs in Arrow and Parquet
> --
>
> Key: ARROW-17910
> URL: https://issues.apache.org/jira/browse/ARROW-17910
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-18068) [Dev][Archery][Crossbow] Comment bot only waits for task if link is not available

2022-10-17 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-18068.
--
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14429
[https://github.com/apache/arrow/pull/14429]

> [Dev][Archery][Crossbow] Comment bot only waits for task if link is not 
> available
> -
>
> Key: ARROW-18068
> URL: https://issues.apache.org/jira/browse/ARROW-18068
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Archery
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> With the change introduced on ARROW-18048 
> ([https://github.com/apache/arrow/pull/14412]) we are waiting the wait time 
> for each task that is triggered on the comment on the PR.
> The problem is that if we execute a group of tasks this will wait the defined 
> amount of seconds (60 as default) for each one of the tasks on the group.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-18043) [R] Properly instantiate empty arrays of extension types in Table__from_schema

2022-10-17 Thread Dewey Dunnington (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dewey Dunnington reassigned ARROW-18043:


Assignee: Dewey Dunnington

> [R] Properly instantiate empty arrays of extension types in Table__from_schema
> --
>
> Key: ARROW-18043
> URL: https://issues.apache.org/jira/browse/ARROW-18043
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dewey Dunnington
>Priority: Major
>
> The PR for ARROW-12105 introduces the function Table__from_schema which 
> creates an empty Table from a Schema object.  Currently it can't handle 
> extension types, and instead just returns NULL type objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18080) [C++] Remove gcc <= 4.9 workarounds

2022-10-17 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-18080:
---

 Summary: [C++] Remove gcc <= 4.9 workarounds
 Key: ARROW-18080
 URL: https://issues.apache.org/jira/browse/ARROW-18080
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Neal Richardson
Assignee: Neal Richardson


Since we require gcc 7ish or greater now that we're on C++17



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17849) [R][Docs] Document changes due to C++17 for centos-7 users

2022-10-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17849:
---
Labels: pull-request-available  (was: )

> [R][Docs] Document changes due to C++17 for centos-7 users
> --
>
> Key: ARROW-17849
> URL: https://issues.apache.org/jira/browse/ARROW-17849
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Reporter: Jacob Wujciak-Jens
>Assignee: Neal Richardson
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With the switch to C++17 centos 7 users need to install and enable devtoolset 
> (and possibly change makevars) to be able to compile the R package, even when 
> using the libarrow binary (see [ARROW-17594]) but that they can use 
> INSTALL_opts = "--build" to get a binary package that is installable on a 
> centos machine WITHOUT dts -> offline build section. Centos 7 RSPM is an 
> alternative source for that binary package. 
> Also add messaging in configure or build_arrow_static.sh so that if someone 
> is trying to install from source with gcc 4.8, we tell them what they need to 
> do.
> This should be documented and noted in the release notes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-16985) [R] Use centos-7 binary on Linux if curl/openssl not found

2022-10-17 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson closed ARROW-16985.
---
Resolution: Won't Fix

No longer valid after the C++17 upgrade

> [R] Use centos-7 binary on Linux if curl/openssl not found
> --
>
> Key: ARROW-16985
> URL: https://issues.apache.org/jira/browse/ARROW-16985
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, R
>Reporter: Neal Richardson
>Priority: Major
>
> Followup to ARROW-16752. This should work; we'll need to add an env var to 
> control whether r_docker_configure.sh installs curl/openssl to test this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-15538) [C++] Create mapping from Substrait "standard functions" to Arrow equivalents

2022-10-17 Thread Tom Drabas (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Drabas reassigned ARROW-15538:
--

Assignee: Tom Drabas

> [C++] Create mapping from Substrait "standard functions" to Arrow equivalents
> -
>
> Key: ARROW-15538
> URL: https://issues.apache.org/jira/browse/ARROW-15538
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Tom Drabas
>Priority: Major
>  Labels: pull-request-available, substrait
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Substrait has a number of "stock" functions defined here: 
> https://github.com/substrait-io/substrait/tree/main/extensions
> This is basically a set of standard extensions.
> We should map these functions to the equivalent Arrow functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17048) [Python] Improving Classes and Methods Docstrings - Type Classes

2022-10-17 Thread Apache Arrow JIRA Bot (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619036#comment-17619036
 ] 

Apache Arrow JIRA Bot commented on ARROW-17048:
---

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [Python] Improving Classes and Methods Docstrings - Type Classes
> 
>
> Key: ARROW-17048
> URL: https://issues.apache.org/jira/browse/ARROW-17048
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>
> Continuation of the initiative aimed at improving methods and classes 
> docstrings, especially from the point of view of ensuring they have an 
> {{Examples}} section.
> This is an umbrella issue for tickets that cover specific [type 
> classes|https://arrow.apache.org/docs/python/api/datatypes.html#type-classes] 
> and could be worked on after 
> https://issues.apache.org/jira/browse/ARROW-16331.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-12084) [C++][Compute] Add remainder and quotient compute::Function

2022-10-17 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-12084:
-

Assignee: (was: Eduardo Ponce)

> [C++][Compute] Add remainder and quotient compute::Function
> ---
>
> Key: ARROW-12084
> URL: https://issues.apache.org/jira/browse/ARROW-12084
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: kernel, pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> In addition to {{divide}} which returns only the quotient, it'd be useful to 
> have a function which returns both quotient and remainder (these are 
> efficient to compute simultaneously), probably as a {{struct remainder: T>}}. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-16970) [C++][Gandiva] Implement hive functions encode and decode

2022-10-17 Thread Apache Arrow JIRA Bot (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-16970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619035#comment-17619035
 ] 

Apache Arrow JIRA Bot commented on ARROW-16970:
---

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++][Gandiva] Implement hive functions encode and decode
> -
>
> Key: ARROW-16970
> URL: https://issues.apache.org/jira/browse/ARROW-16970
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva
>Reporter: Sahaj Gupta
>Assignee: Sahaj Gupta
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 11h 50m
>  Remaining Estimate: 0h
>
> ENCODE(UTF-8 -> UTF-16BE and UTF-8 -> UTF-16LE) and DECODE(UTF-16BE -> UTF-8 
> and UTF-16LE -> UTF-8)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-12084) [C++][Compute] Add remainder and quotient compute::Function

2022-10-17 Thread Apache Arrow JIRA Bot (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-12084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619037#comment-17619037
 ] 

Apache Arrow JIRA Bot commented on ARROW-12084:
---

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++][Compute] Add remainder and quotient compute::Function
> ---
>
> Key: ARROW-12084
> URL: https://issues.apache.org/jira/browse/ARROW-12084
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: kernel, pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> In addition to {{divide}} which returns only the quotient, it'd be useful to 
> have a function which returns both quotient and remainder (these are 
> efficient to compute simultaneously), probably as a {{struct remainder: T>}}. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-16970) [C++][Gandiva] Implement hive functions encode and decode

2022-10-17 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-16970:
-

Assignee: (was: Sahaj Gupta)

> [C++][Gandiva] Implement hive functions encode and decode
> -
>
> Key: ARROW-16970
> URL: https://issues.apache.org/jira/browse/ARROW-16970
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva
>Reporter: Sahaj Gupta
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 11h 50m
>  Remaining Estimate: 0h
>
> ENCODE(UTF-8 -> UTF-16BE and UTF-8 -> UTF-16LE) and DECODE(UTF-16BE -> UTF-8 
> and UTF-16LE -> UTF-8)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-17048) [Python] Improving Classes and Methods Docstrings - Type Classes

2022-10-17 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-17048:
-

Assignee: (was: Alenka Frim)

> [Python] Improving Classes and Methods Docstrings - Type Classes
> 
>
> Key: ARROW-17048
> URL: https://issues.apache.org/jira/browse/ARROW-17048
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alenka Frim
>Priority: Major
>
> Continuation of the initiative aimed at improving methods and classes 
> docstrings, especially from the point of view of ensuring they have an 
> {{Examples}} section.
> This is an umbrella issue for tickets that cover specific [type 
> classes|https://arrow.apache.org/docs/python/api/datatypes.html#type-classes] 
> and could be worked on after 
> https://issues.apache.org/jira/browse/ARROW-16331.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-15538) [C++] Create mapping from Substrait "standard functions" to Arrow equivalents

2022-10-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15538:
---
Labels: pull-request-available substrait  (was: substrait)

> [C++] Create mapping from Substrait "standard functions" to Arrow equivalents
> -
>
> Key: ARROW-15538
> URL: https://issues.apache.org/jira/browse/ARROW-15538
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: pull-request-available, substrait
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Substrait has a number of "stock" functions defined here: 
> https://github.com/substrait-io/substrait/tree/main/extensions
> This is basically a set of standard extensions.
> We should map these functions to the equivalent Arrow functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-17991) [Python] pyarrow.dataset IPC format does not support compression

2022-10-17 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-17991.
--
Resolution: Fixed

Issue resolved by pull request 14414
[https://github.com/apache/arrow/pull/14414]

> [Python] pyarrow.dataset IPC format does not support compression
> 
>
> Key: ARROW-17991
> URL: https://issues.apache.org/jira/browse/ARROW-17991
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joost Hoozemans
>Assignee: Joost Hoozemans
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> When trying to write an IPC dataset using pyarrow.dataset, it is not possible 
> to pass a compression argument:
> Trying to pass a pyarrow.ipc.IpcWriteOptions object:
> >>> ds.write_dataset(f, "./thing.arrow", format=ds.IpcFileFormat(), 
> >>> file_options=ipc.IpcWriteOptions(compression='lz4'))
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/joost/.cache/pypoetry/virtualenvs/datalogistik-rL_l_suP-py3.8/lib/python3.8/site-packages/pyarrow/dataset.py",
>  line 940, in write_dataset
>     if format != file_options.format:
> AttributeError: 'pyarrow.lib.IpcWriteOptions' object has no attribute 'format'
>  
> Alternatively, pyarrow.dataset.IpcFileFormat().make_write_options() does not 
> support a compression parameter



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18079) [R] Performance regressions after ARROW-12105

2022-10-17 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-18079:


 Summary: [R] Performance regressions after ARROW-12105
 Key: ARROW-18079
 URL: https://issues.apache.org/jira/browse/ARROW-18079
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane


In ARROW-12105 the functionality implemented introduced some performance 
regressions that we should sort out before the release.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-18079) [R] Performance regressions after ARROW-12105

2022-10-17 Thread Nicola Crane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-18079:
-
Fix Version/s: 10.0.0

> [R] Performance regressions after ARROW-12105
> -
>
> Key: ARROW-18079
> URL: https://issues.apache.org/jira/browse/ARROW-18079
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Nicola Crane
>Priority: Blocker
> Fix For: 10.0.0
>
>
> In ARROW-12105 the functionality implemented introduced some performance 
> regressions that we should sort out before the release.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-18069) Suggest use force with lease where possible

2022-10-17 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18069.

Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14430
[https://github.com/apache/arrow/pull/14430]

> Suggest use force with lease where possible
> ---
>
> Key: ARROW-18069
> URL: https://issues.apache.org/jira/browse/ARROW-18069
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
>  Labels: documentation, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-18069) Suggest use force with lease where possible

2022-10-17 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18069:
---
Fix Version/s: 10.0.0
   (was: 11.0.0)

> Suggest use force with lease where possible
> ---
>
> Key: ARROW-18069
> URL: https://issues.apache.org/jira/browse/ARROW-18069
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Minor
>  Labels: documentation, pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-18063) [C++][Python] Custom streaming data providers in {{run_query}}

2022-10-17 Thread Li Jin (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618950#comment-17618950
 ] 

Li Jin edited comment on ARROW-18063 at 10/17/22 3:11 PM:
--

{quote}It might be slightly nicer to throw an error when setting the default 
named table provider if it has already been set. There are more complex 
alternatives such as a named table provider registry or a chain of named table 
providers but I'm not sure they are needed in this case.
{quote}
I think either override or raise error is fine. In practice I don't see our 
application would need to invoke the initialization of custom registration more 
than once.

 
{quote}Another alternative, which might be a more long term solution, is to 
create a new Substrait extension which defines a new {{read_type}} (e.g. 
{{{}ExtensionTable{}}}) which contains the needed information (e.g. URL).

We would then need to make it possible to construct custom sources from 
{{ExtensionTable}} though which probably puts us in roughly the same boat :). 
We would need an {{ExtensionTableProvider}} and we would probably want the 
default to be configurable.
{quote}
I have the same thinking as well. Long term we should allow user to register 
custom ExtensionTableProvider as well and ideally with the similar way of how 
to extend ExecFactoryRegistry and NamedTableProvider.


was (Author: icexelloss):
>It might be slightly nicer to throw an error when setting the default named 
>table provider if it has already been set. There are more complex alternatives 
>such as a named table provider registry or a chain of named table providers 
>but I'm not sure they are needed in this case.

I think either override or raise error is fine. In practice I don't see our 
application would need to invoke the initialization of custom registration more 
than once.

 

>Another alternative, which might be a more long term solution, is to create a 
>new Substrait extension which defines a new {{read_type}} (e.g. 
>{{{}ExtensionTable{}}}) which contains the needed information (e.g. URL).

>We would then need to make it possible to construct custom sources from 
>{{ExtensionTable}} though which probably puts us in roughly the same boat :). 
>We would need an {{ExtensionTableProvider}} and we would probably want the 
>default to be configurable.

I have the same thinking as well. Long term we should allow user to register 
custom ExtensionTableProvider as well and ideally with the similar way of how 
to extend ExecFactoryRegistry and NamedTableProvider.

> [C++][Python] Custom streaming data providers in {{run_query}}
> --
>
> Key: ARROW-18063
> URL: https://issues.apache.org/jira/browse/ARROW-18063
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
>
> [Mailing list 
> thread|https://lists.apache.org/thread/r484sqrd6xjdd058prbrcwh3t5vg91so]
> The goal is to:
> - generate a substrait plan in Python using Ibis
> - ... wherein tables are specified using custom URLs
> - use the python API {{run_query}} to execute the plan
> - ... against source data which is *streamed* from those URLs rather than 
> pulled fully into local memory
> The obstacles include:
> - The API for constructing a data stream from the custom URLs is only 
> available in c++
> - The python {{run_query}} function requires tables as input and cannot 
> accept a RecordBatchReader even if one could be constructed from a custom URL
> - Writing custom cython is not preferred
> Some potential solutions:
> - Use ExecuteSerializedPlan() directly usable from c++ so that construction 
> of data sources need not be handled in python. Passing a buffer from 
> python/ibis down to C++ is much simpler and can be navigated without writing 
> cython
> - Refactor NamedTableProvider from a lambda mapping {{names -> data source}} 
> into a registry so that data source factories can be added from c++ then 
> referenced by name from python
> - Extend {{run_query}} to support non-Table sources and require the user to 
> write a python mapping from URLs to {{pa.RecordBatchReader}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-18063) [C++][Python] Custom streaming data providers in {{run_query}}

2022-10-17 Thread Li Jin (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618950#comment-17618950
 ] 

Li Jin edited comment on ARROW-18063 at 10/17/22 3:10 PM:
--

>It might be slightly nicer to throw an error when setting the default named 
>table provider if it has already been set. There are more complex alternatives 
>such as a named table provider registry or a chain of named table providers 
>but I'm not sure they are needed in this case.

I think either override or raise error is fine. In practice I don't see our 
application would need to invoke the initialization of custom registration more 
than once.

 

>Another alternative, which might be a more long term solution, is to create a 
>new Substrait extension which defines a new {{read_type}} (e.g. 
>{{{}ExtensionTable{}}}) which contains the needed information (e.g. URL).

>We would then need to make it possible to construct custom sources from 
>{{ExtensionTable}} though which probably puts us in roughly the same boat :). 
>We would need an {{ExtensionTableProvider}} and we would probably want the 
>default to be configurable.

I have the same thinking as well. Long term we should allow user to register 
custom ExtensionTableProvider as well and ideally with the similar way of how 
to extend ExecFactoryRegistry and NamedTableProvider.


was (Author: icexelloss):
>It might be slightly nicer to throw an error when setting the default named 
>table provider if it has already been set. There are more complex alternatives 
>such as a named table provider registry or a chain of named table providers 
>but I'm not sure they are needed in this case.

I think either override or raise error is fine. In practice I don't see our 
application would need to invoke the initialization of custom registration more 
than once.

 

>Another alternative, which might be a more long term solution, is to create a 
>new Substrait extension which defines a new {{read_type}} (e.g. 
>{{{}ExtensionTable{}}}) which contains the needed information (e.g. URL).

We would then need to make it possible to construct custom sources from 
{{ExtensionTable}} though which probably puts us in roughly the same boat :). 
We would need an {{ExtensionTableProvider}} and we would probably want the 
default to be configurable.

I have the same thinking as well. Long term we should allow user to register 
custom ExtensionTableProvider as well and ideally with the similar way of how 
to extend ExecFactoryRegistry and NamedTableProvider.

> [C++][Python] Custom streaming data providers in {{run_query}}
> --
>
> Key: ARROW-18063
> URL: https://issues.apache.org/jira/browse/ARROW-18063
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
>
> [Mailing list 
> thread|https://lists.apache.org/thread/r484sqrd6xjdd058prbrcwh3t5vg91so]
> The goal is to:
> - generate a substrait plan in Python using Ibis
> - ... wherein tables are specified using custom URLs
> - use the python API {{run_query}} to execute the plan
> - ... against source data which is *streamed* from those URLs rather than 
> pulled fully into local memory
> The obstacles include:
> - The API for constructing a data stream from the custom URLs is only 
> available in c++
> - The python {{run_query}} function requires tables as input and cannot 
> accept a RecordBatchReader even if one could be constructed from a custom URL
> - Writing custom cython is not preferred
> Some potential solutions:
> - Use ExecuteSerializedPlan() directly usable from c++ so that construction 
> of data sources need not be handled in python. Passing a buffer from 
> python/ibis down to C++ is much simpler and can be navigated without writing 
> cython
> - Refactor NamedTableProvider from a lambda mapping {{names -> data source}} 
> into a registry so that data source factories can be added from c++ then 
> referenced by name from python
> - Extend {{run_query}} to support non-Table sources and require the user to 
> write a python mapping from URLs to {{pa.RecordBatchReader}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-18063) [C++][Python] Custom streaming data providers in {{run_query}}

2022-10-17 Thread Li Jin (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618950#comment-17618950
 ] 

Li Jin commented on ARROW-18063:


>It might be slightly nicer to throw an error when setting the default named 
>table provider if it has already been set. There are more complex alternatives 
>such as a named table provider registry or a chain of named table providers 
>but I'm not sure they are needed in this case.

I think either override or raise error is fine. In practice I don't see our 
application would need to invoke the initialization of custom registration more 
than once.

 

>Another alternative, which might be a more long term solution, is to create a 
>new Substrait extension which defines a new {{read_type}} (e.g. 
>{{{}ExtensionTable{}}}) which contains the needed information (e.g. URL).

We would then need to make it possible to construct custom sources from 
{{ExtensionTable}} though which probably puts us in roughly the same boat :). 
We would need an {{ExtensionTableProvider}} and we would probably want the 
default to be configurable.

I have the same thinking as well. Long term we should allow user to register 
custom ExtensionTableProvider as well and ideally with the similar way of how 
to extend ExecFactoryRegistry and NamedTableProvider.

> [C++][Python] Custom streaming data providers in {{run_query}}
> --
>
> Key: ARROW-18063
> URL: https://issues.apache.org/jira/browse/ARROW-18063
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
>
> [Mailing list 
> thread|https://lists.apache.org/thread/r484sqrd6xjdd058prbrcwh3t5vg91so]
> The goal is to:
> - generate a substrait plan in Python using Ibis
> - ... wherein tables are specified using custom URLs
> - use the python API {{run_query}} to execute the plan
> - ... against source data which is *streamed* from those URLs rather than 
> pulled fully into local memory
> The obstacles include:
> - The API for constructing a data stream from the custom URLs is only 
> available in c++
> - The python {{run_query}} function requires tables as input and cannot 
> accept a RecordBatchReader even if one could be constructed from a custom URL
> - Writing custom cython is not preferred
> Some potential solutions:
> - Use ExecuteSerializedPlan() directly usable from c++ so that construction 
> of data sources need not be handled in python. Passing a buffer from 
> python/ibis down to C++ is much simpler and can be navigated without writing 
> cython
> - Refactor NamedTableProvider from a lambda mapping {{names -> data source}} 
> into a registry so that data source factories can be added from c++ then 
> referenced by name from python
> - Extend {{run_query}} to support non-Table sources and require the user to 
> write a python mapping from URLs to {{pa.RecordBatchReader}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-12616) [Python] S3FileSystem OSError when writing to already created bucket

2022-10-17 Thread Samuel Sanders (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Samuel Sanders closed ARROW-12616.
--
Fix Version/s: 9.0.0
 Assignee: Samuel Sanders
   Resolution: Works for Me

The implementation appears to be complete when you use pyarrow table version >= 
2.0, I did not test all configurations, only 9.0.0 with 2.6

> [Python] S3FileSystem OSError when writing to already created bucket
> 
>
> Key: ARROW-12616
> URL: https://issues.apache.org/jira/browse/ARROW-12616
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 4.0.0, 8.0.0
>Reporter: Samuel Sanders
>Assignee: Samuel Sanders
>Priority: Major
> Fix For: 9.0.0
>
>
> When calling parquet.write_to_dataset with an S3FileSystem, the following 
> error occurs if the bucket already exists:
> OSError: When creating bucket 'bucket-name': AWS Error [code 100]: Unable to 
> parse ExceptionName: InvalidLocationConstraint Message: The specified 
> location-constraint is not valid
> The S3FileSystem instance was generated based URI and based on region, but 
> both returned the same error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-12616) [Python] S3FileSystem OSError when writing to already created bucket

2022-10-17 Thread Samuel Sanders (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-12616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618938#comment-17618938
 ] 

Samuel Sanders commented on ARROW-12616:


This is no longer an issue when using version='2.6' in pyarrow version 9.0.0

> [Python] S3FileSystem OSError when writing to already created bucket
> 
>
> Key: ARROW-12616
> URL: https://issues.apache.org/jira/browse/ARROW-12616
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 4.0.0, 8.0.0
>Reporter: Samuel Sanders
>Priority: Major
>
> When calling parquet.write_to_dataset with an S3FileSystem, the following 
> error occurs if the bucket already exists:
> OSError: When creating bucket 'bucket-name': AWS Error [code 100]: Unable to 
> parse ExceptionName: InvalidLocationConstraint Message: The specified 
> location-constraint is not valid
> The S3FileSystem instance was generated based URI and based on region, but 
> both returned the same error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-18078) Broken link in R documentation

2022-10-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-18078:
---
Labels: pull-request-available  (was: )

> Broken link in R documentation
> --
>
> Key: ARROW-18078
> URL: https://issues.apache.org/jira/browse/ARROW-18078
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, R
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Broken link in rendered site:
> https://arrow.apache.org/docs/r/articles/developers/install_details.html#using-the-r-package-with-libarrow-installed-as-a-system-package
> corresponding to the text *“Troubleshooting” section in the main installation 
> docs*
> https://github.com/apache/arrow/blob/master/r/vignettes/developers/install_details.Rmd#L121



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17460) [R] Don't warn if the new UDF I'm registering is the same as the existing one

2022-10-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17460:
---
Labels: pull-request-available  (was: )

> [R] Don't warn if the new UDF I'm registering is the same as the existing one
> -
>
> Key: ARROW-17460
> URL: https://issues.apache.org/jira/browse/ARROW-17460
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 9.0.0
>Reporter: Neal Richardson
>Assignee: Dewey Dunnington
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When you register a UDF, you get a warning if there already is a function 
> with that name. This can be helpful, but it's a nuisance if you're running a 
> script more than once interactively. I encountered this while working in an R 
> Markdown document for a demo: the second time I rendered the doc in my R 
> session, the warning was printed in the rendered document. 
> It would be nice if we could determine that the UDF being registered is the 
> same as the one that's already there and not warn in that case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18078) Broken link in R documentation

2022-10-17 Thread Benson Muite (Jira)

Benson Muite created ARROW-18078:


 Summary: Broken link in R documentation
 Key: ARROW-18078
 URL: https://issues.apache.org/jira/browse/ARROW-18078
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation, R
Reporter: Benson Muite
Assignee: Benson Muite


Broken link in rendered site:
https://arrow.apache.org/docs/r/articles/developers/install_details.html#using-the-r-package-with-libarrow-installed-as-a-system-package
corresponding to the text *“Troubleshooting” section in the main installation 
docs*
https://github.com/apache/arrow/blob/master/r/vignettes/developers/install_details.Rmd#L121



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-15904) [C++] Support rolling backwards and forwards with temporal arithmetic

2022-10-17 Thread Alessandro Molina (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-15904:
--
Priority: Major  (was: Blocker)

> [C++] Support rolling backwards and forwards with temporal arithmetic
> -
>
> Key: ARROW-15904
> URL: https://issues.apache.org/jira/browse/ARROW-15904
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Rok Mihevc
>Priority: Major
>
> Original description in ARROW-11090: 
> "This should also cover the ability to do with and without rollback (so have 
> the ability to do e.g. 2021-03-30 minus 1 month and either get a null back, 
> or 2021-02-28), plus the ability to specify whether to rollback to the first 
> or last, and whether to preserve or rest the time.)"
> For example, in R, lubridate has the following functionality:
> * {{rollbackward()}} or {{rollback()}} which changes a date to the last day 
> of the previous month or to the first day of the current month
> * {{rollforward()}} which rolls to the last day of the current month or to 
> the first day of the next month.
> * all of the above also offer the option to preserve hms (hours, minutes and 
> seconds) when rolling. 
> This functionality underpins functions such as {{%m-%}} and {{%m+%}} which 
> are used to add or subtract months to a date without exceeding the last day 
> of the new month.
> {code:r}
> library(lubridate)
> jan <- ymd_hms("2010-01-31 03:04:05")
> jan + months(1:3) # Feb 31 and April 31 returned as NA
> #> [1] NA"2010-03-31 03:04:05 UTC"
> #> [3] NA
> # NA "2010-03-31 03:04:05 UTC" NA
> jan %m+% months(1:3) # No rollover
> #> [1] "2010-02-28 03:04:05 UTC" "2010-03-31 03:04:05 UTC"
> #> [3] "2010-04-30 03:04:05 UTC"
> leap <- ymd("2012-02-29")
> "2012-02-29 UTC"
> #> [1] "2012-02-29 UTC"
> leap %m+% years(1)
> #> [1] "2013-02-28"
> leap %m+% years(-1)
> #> [1] "2011-02-28"
> leap %m-% years(1)
> #> [1] "2011-02-28"
> x <- ymd_hms("2019-01-29 01:02:03")
> add_with_rollback(x, months(1))
> #> [1] "2019-02-28 01:02:03 UTC"
> add_with_rollback(x, months(1), preserve_hms = FALSE)
> #> [1] "2019-02-28 UTC"
> add_with_rollback(x, months(1), roll_to_first = TRUE)
> #> [1] "2019-03-01 01:02:03 UTC"
> add_with_rollback(x, months(1), roll_to_first = TRUE, preserve_hms = FALSE)
> #> [1] "2019-03-01 UTC"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17932) [C++] Implement streaming RecordBatchReader for JSON

2022-10-17 Thread Alessandro Molina (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-17932:
--
Priority: Major  (was: Blocker)

> [C++] Implement streaming RecordBatchReader for JSON
> 
>
> Key: ARROW-17932
> URL: https://issues.apache.org/jira/browse/ARROW-17932
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Ben Harkins
>Assignee: Ben Harkins
>Priority: Major
>  Labels: json, pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We don't currently support incremental RecordBatch reading from JSON streams, 
> which is needed to properly implement JSON support in Dataset. The existing 
> CSV StreamingReader API can be used as a model.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (ARROW-18077) [CI][C++] Builds with ORC enabled fail on ubuntu 18.04 jobs

2022-10-17 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-18077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido closed ARROW-18077.
-
Fix Version/s: (was: 10.0.0)
   Resolution: Duplicate

Sorry, it seems this was merged some hours ago and should be fixed on the next 
nightlies round

> [CI][C++] Builds with ORC enabled fail on ubuntu 18.04 jobs
> ---
>
> Key: ARROW-18077
> URL: https://issues.apache.org/jira/browse/ARROW-18077
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Raúl Cumplido
>Priority: Blocker
>  Labels: Nightly
>
> Currently some of our nightlies are broken with the following error:
> {code:java}
>  -- Building Abseil-cpp from source
> -- Building gRPC from source
> -- Found hdfs.h at: /arrow/cpp/thirdparty/hadoop/include/hdfs.h
> -- Building Apache ORC from source
> CMake Error at cmake_modules/ThirdpartyToolchain.cmake:4341 
> (target_link_libraries):
>   Cannot specify link libraries for target "orc::liborc" which is not built
>   by this project.
> Call Stack (most recent call first):
>   cmake_modules/ThirdpartyToolchain.cmake:183 (build_orc)
>   cmake_modules/ThirdpartyToolchain.cmake:274 (build_dependency)
>   cmake_modules/ThirdpartyToolchain.cmake:4357 (resolve_dependency)
>   CMakeLists.txt:496 (include)
> -- Configuring incomplete, errors occurred! {code}
> It seems to be related to https://issues.apache.org/jira/browse/ARROW-17817 
> ([https://github.com/apache/arrow/pull/14208)] as the first appearance of the 
> issue was once the above was merged.
> The full list of changes on the first failure appearance:
> [https://github.com/apache/arrow/compare/99b40926c72bcedc056ce2c869ada4d4d02fc509...0b86e40622f7153d64b36b4e65e0c0ace15d6ffa]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-18077) [CI][C++] Builds with ORC enabled fail on ubuntu 18.04 jobs

2022-10-17 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-18077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido updated ARROW-18077:
--
Summary: [CI][C++] Builds with ORC enabled fail on ubuntu 18.04 jobs  (was: 
[CI][C++] ubuntu 18.04 builds with ORC enabled fail)

> [CI][C++] Builds with ORC enabled fail on ubuntu 18.04 jobs
> ---
>
> Key: ARROW-18077
> URL: https://issues.apache.org/jira/browse/ARROW-18077
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Raúl Cumplido
>Priority: Blocker
>  Labels: Nightly
> Fix For: 10.0.0
>
>
> Currently some of our nightlies are broken with the following error:
> {code:java}
>  -- Building Abseil-cpp from source
> -- Building gRPC from source
> -- Found hdfs.h at: /arrow/cpp/thirdparty/hadoop/include/hdfs.h
> -- Building Apache ORC from source
> CMake Error at cmake_modules/ThirdpartyToolchain.cmake:4341 
> (target_link_libraries):
>   Cannot specify link libraries for target "orc::liborc" which is not built
>   by this project.
> Call Stack (most recent call first):
>   cmake_modules/ThirdpartyToolchain.cmake:183 (build_orc)
>   cmake_modules/ThirdpartyToolchain.cmake:274 (build_dependency)
>   cmake_modules/ThirdpartyToolchain.cmake:4357 (resolve_dependency)
>   CMakeLists.txt:496 (include)
> -- Configuring incomplete, errors occurred! {code}
> It seems to be related to https://issues.apache.org/jira/browse/ARROW-17817 
> ([https://github.com/apache/arrow/pull/14208)] as the first appearance of the 
> issue was once the above was merged.
> The full list of changes on the first failure appearance:
> [https://github.com/apache/arrow/compare/99b40926c72bcedc056ce2c869ada4d4d02fc509...0b86e40622f7153d64b36b4e65e0c0ace15d6ffa]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18077) [CI][C++] ubuntu 18.04 builds with ORC enabled fail

2022-10-17 Thread Jira

Raúl Cumplido created ARROW-18077:
-

 Summary: [CI][C++] ubuntu 18.04 builds with ORC enabled fail
 Key: ARROW-18077
 URL: https://issues.apache.org/jira/browse/ARROW-18077
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Reporter: Raúl Cumplido
 Fix For: 10.0.0


Currently some of our nightlies are broken with the following error:
{code:java}
 -- Building Abseil-cpp from source
-- Building gRPC from source
-- Found hdfs.h at: /arrow/cpp/thirdparty/hadoop/include/hdfs.h
-- Building Apache ORC from source
CMake Error at cmake_modules/ThirdpartyToolchain.cmake:4341 
(target_link_libraries):
  Cannot specify link libraries for target "orc::liborc" which is not built
  by this project.
Call Stack (most recent call first):
  cmake_modules/ThirdpartyToolchain.cmake:183 (build_orc)
  cmake_modules/ThirdpartyToolchain.cmake:274 (build_dependency)
  cmake_modules/ThirdpartyToolchain.cmake:4357 (resolve_dependency)
  CMakeLists.txt:496 (include)
-- Configuring incomplete, errors occurred! {code}
It seems to be related to https://issues.apache.org/jira/browse/ARROW-17817 
([https://github.com/apache/arrow/pull/14208)] as the first appearance of the 
issue was once the above was merged.

The full list of changes on the first failure appearance:

[https://github.com/apache/arrow/compare/99b40926c72bcedc056ce2c869ada4d4d02fc509...0b86e40622f7153d64b36b4e65e0c0ace15d6ffa]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (ARROW-17849) [R][Docs] Document changes due to C++17 for centos-7 users

2022-10-17 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-17849:
---

Assignee: Neal Richardson  (was: Benson Muite)

> [R][Docs] Document changes due to C++17 for centos-7 users
> --
>
> Key: ARROW-17849
> URL: https://issues.apache.org/jira/browse/ARROW-17849
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Reporter: Jacob Wujciak-Jens
>Assignee: Neal Richardson
>Priority: Blocker
> Fix For: 10.0.0
>
>
> With the switch to C++17 centos 7 users need to install and enable devtoolset 
> (and possibly change makevars) to be able to compile the R package, even when 
> using the libarrow binary (see [ARROW-17594]) but that they can use 
> INSTALL_opts = "--build" to get a binary package that is installable on a 
> centos machine WITHOUT dts -> offline build section. Centos 7 RSPM is an 
> alternative source for that binary package. 
> Also add messaging in configure or build_arrow_static.sh so that if someone 
> is trying to install from source with gcc 4.8, we tell them what they need to 
> do.
> This should be documented and noted in the release notes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17849) [R][Docs] Document changes due to C++17 for centos-7 users

2022-10-17 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618901#comment-17618901
 ] 

Neal Richardson commented on ARROW-17849:
-

Thanks [~baksmj]. This issue is actually about something slightly different 
than that, so I'm going to take it. 

> [R][Docs] Document changes due to C++17 for centos-7 users
> --
>
> Key: ARROW-17849
> URL: https://issues.apache.org/jira/browse/ARROW-17849
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Reporter: Jacob Wujciak-Jens
>Assignee: Benson Muite
>Priority: Blocker
> Fix For: 10.0.0
>
>
> With the switch to C++17 centos 7 users need to install and enable devtoolset 
> (and possibly change makevars) to be able to compile the R package, even when 
> using the libarrow binary (see [ARROW-17594]) but that they can use 
> INSTALL_opts = "--build" to get a binary package that is installable on a 
> centos machine WITHOUT dts -> offline build section. Centos 7 RSPM is an 
> alternative source for that binary package. 
> Also add messaging in configure or build_arrow_static.sh so that if someone 
> is trying to install from source with gcc 4.8, we tell them what they need to 
> do.
> This should be documented and noted in the release notes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17849) [R][Docs] Document changes due to C++17 for centos-7 users

2022-10-17 Thread Benson Muite (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618851#comment-17618851
 ] 

Benson Muite commented on ARROW-17849:
--

Rough steps to get R, and a minimal Arrow build. Probably also want a recent 
version of Python and run tests.

{code:bash}
yum install -y centos-release-scl curl diffutils gcc-c++ libcurl-devel make 
openssl-devel wget which
yum install -y devtoolset-8
source scl_source enable devtoolset-8
yum install -y epel-release || yum install -y 
https://dl.fedoraproject.org/pub/epel/epel-release-latest-$(cut -d: -f5 
/etc/system-release-cpe | cut -d. -f1).noarch.rpm
export R_VERSION=4.2.1
curl -O https://cran.rstudio.com/src/base/R-4/R-${R_VERSION}.tar.gz
tar -xzvf R-${R_VERSION}.tar.gz
yum install readline-devel
./configure --prefix=/opt/R/${R_VERSION} --enable-memory-profiling 
--enable-R-shlib --with-blas --with-lapack
yum install x11-devel
yum install xorg-x11-server-Xorg xorg-x11-xauth xorg-x11-apps
yum install libX11-devel xorg-x11-server-devel
yum install libXt-devel
yum install libjpeg-devel libpng-devel libtiff-devel libcairo-devel
yum install cairo-devel
yum install perl
wget -qO- "https://yihui.org/tinytex/install-bin-unix.sh"; | sh
cd R-${R_VERSION}
yum install pandoc texinfo
./configure --prefix=/opt/R/${R_VERSION} --enable-memory-profiling 
--enable-R-shlib --with-blas --with-lapack
tlmgr install grfext
yum install java-11-openjdk-devel
make
make check
make install
make install-pdf
make install-info
tlmgr install texi2dvi
yum install texinfo-tex
export PATH=$PATH:/opt/R/4.2.1/bin/
cd ..
git clone https://github.com/apache/arrow
wget 
https://github.com/Kitware/CMake/releases/download/v3.24.2/cmake-3.24.2.tar.gz
tar -xf cmake-3.24.2.tar.gz 
cd cmake-3.24.2
./configure 
make
make -j 4
make install
make test
cd ..
arrow/
cd ci
cd scripts/
bash install_sccache.sh unknown-linux-musl /usr/local/bin
cd ..
cd ..
bash r/inst/build_arrow_static.sh 
R --vanilla -e 
"install.packages('assertthat',repos='http://cran.us.r-project.org')"
R --vanilla -e "install.packages('bit64',repos='http://cran.us.r-project.org')"
R --vanilla -e "install.packages('glue',repos='http://cran.us.r-project.org')"
R --vanilla -e "install.packages('purrr',repos='http://cran.us.r-project.org')"
R --vanilla -e "install.packages('R6',repos='http://cran.us.r-project.org')"
R --vanilla -e "install.packages('rlang',repos='http://cran.us.r-project.org')"
R --vanilla -e 
"install.packages('tidyselect',repos='http://cran.us.r-project.org')"
R --vanilla -e "install.packages('vectrs',repos='http://cran.us.r-project.org')"
R --vanilla -e "install.packages('styler',repos='http://cran.us.r-project.org')"
{code}


> [R][Docs] Document changes due to C++17 for centos-7 users
> --
>
> Key: ARROW-17849
> URL: https://issues.apache.org/jira/browse/ARROW-17849
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Reporter: Jacob Wujciak-Jens
>Assignee: Benson Muite
>Priority: Blocker
> Fix For: 10.0.0
>
>
> With the switch to C++17 centos 7 users need to install and enable devtoolset 
> (and possibly change makevars) to be able to compile the R package, even when 
> using the libarrow binary (see [ARROW-17594]) but that they can use 
> INSTALL_opts = "--build" to get a binary package that is installable on a 
> centos machine WITHOUT dts -> offline build section. Centos 7 RSPM is an 
> alternative source for that binary package. 
> Also add messaging in configure or build_arrow_static.sh so that if someone 
> is trying to install from source with gcc 4.8, we tell them what they need to 
> do.
> This should be documented and noted in the release notes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18076) PyArrow cannot read from R2 (Cloudflare's S3)

2022-10-17 Thread Vedant Roy (Jira)

Vedant Roy created ARROW-18076:
--

 Summary: PyArrow cannot read from R2 (Cloudflare's S3)
 Key: ARROW-18076
 URL: https://issues.apache.org/jira/browse/ARROW-18076
 Project: Apache Arrow
  Issue Type: Bug
 Environment: Ubuntu 20
Reporter: Vedant Roy


When using pyarrow to read parquet data (as part of the Ray project), I get the 
following stracktrace:

```
(_sample_piece pid=49818) Traceback (most recent call last):
(_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 859, in 
ray._raylet.execute_task
(_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 863, in 
ray._raylet.execute_task
(_sample_piece pid=49818)   File 
"/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py",
 line 446, in _sample_piece
(_sample_piece pid=49818) batch = next(batches)
(_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 3202, in _iterator
(_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 2891, in 
pyarrow._dataset.TaggedRecordBatchIterator.__next__
(_sample_piece pid=49818)   File "pyarrow/error.pxi", line 143, in 
pyarrow.lib.pyarrow_internal_check_status
(_sample_piece pid=49818)   File "pyarrow/error.pxi", line 114, in 
pyarrow.lib.check_status
(_sample_piece pid=49818) OSError: AWS Error [code 99]: curlCode: 18, 
Transferred a partial file
```

I do not get this error when using Amazon S3 for the exact same data.
The error is coming from this line:
https://github.com/ray-project/ray/blob/6fb605379a726d889bd25cf0ee4ed335c74408ff/python/ray/data/datasource/parquet_datasource.py#L446



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17991) [Python] pyarrow.dataset IPC format does not support compression

2022-10-17 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-17991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-17991:
---
Fix Version/s: 11.0.0

> [Python] pyarrow.dataset IPC format does not support compression
> 
>
> Key: ARROW-17991
> URL: https://issues.apache.org/jira/browse/ARROW-17991
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joost Hoozemans
>Assignee: Joost Hoozemans
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> When trying to write an IPC dataset using pyarrow.dataset, it is not possible 
> to pass a compression argument:
> Trying to pass a pyarrow.ipc.IpcWriteOptions object:
> >>> ds.write_dataset(f, "./thing.arrow", format=ds.IpcFileFormat(), 
> >>> file_options=ipc.IpcWriteOptions(compression='lz4'))
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/joost/.cache/pypoetry/virtualenvs/datalogistik-rL_l_suP-py3.8/lib/python3.8/site-packages/pyarrow/dataset.py",
>  line 940, in write_dataset
>     if format != file_options.format:
> AttributeError: 'pyarrow.lib.IpcWriteOptions' object has no attribute 'format'
>  
> Alternatively, pyarrow.dataset.IpcFileFormat().make_write_options() does not 
> support a compression parameter



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-18074) [CI] Running ctest for PyArrow C++ not needed anymore

2022-10-17 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-18074:
---
Fix Version/s: 10.0.0
   (was: 11.0.0)

> [CI] Running ctest for PyArrow C++ not needed anymore
> -
>
> Key: ARROW-18074
> URL: https://issues.apache.org/jira/browse/ARROW-18074
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In {{ci/scripts/python_test.sh}} we are still running {{ctest}} to check C++ 
> tests for PyArrow:
> [https://github.com/apache/arrow/blob/b832853ba62171d5fe5077681083fc6ea49bfd44/ci/scripts/python_test.sh#L58-L66]
> After https://issues.apache.org/jira/browse/ARROW-17016 was resolved that is 
> not needed anymore and can be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARROW-18074) [CI] Running ctest for PyArrow C++ not needed anymore

2022-10-17 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-18074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-18074.

Fix Version/s: 11.0.0
   (was: 10.0.0)
   Resolution: Fixed

Issue resolved by pull request 14435
[https://github.com/apache/arrow/pull/14435]

> [CI] Running ctest for PyArrow C++ not needed anymore
> -
>
> Key: ARROW-18074
> URL: https://issues.apache.org/jira/browse/ARROW-18074
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In {{ci/scripts/python_test.sh}} we are still running {{ctest}} to check C++ 
> tests for PyArrow:
> [https://github.com/apache/arrow/blob/b832853ba62171d5fe5077681083fc6ea49bfd44/ci/scripts/python_test.sh#L58-L66]
> After https://issues.apache.org/jira/browse/ARROW-17016 was resolved that is 
> not needed anymore and can be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17849) [R][Docs] Document changes due to C++17 for centos-7 users

2022-10-17 Thread Benson Muite (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618809#comment-17618809
 ] 

Benson Muite commented on ARROW-17849:
--

Verification currently uses devtoolset-11 
https://github.com/apache/arrow/blob/master/dev/release/verify-yum.sh

> [R][Docs] Document changes due to C++17 for centos-7 users
> --
>
> Key: ARROW-17849
> URL: https://issues.apache.org/jira/browse/ARROW-17849
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Reporter: Jacob Wujciak-Jens
>Assignee: Benson Muite
>Priority: Blocker
> Fix For: 10.0.0
>
>
> With the switch to C++17 centos 7 users need to install and enable devtoolset 
> (and possibly change makevars) to be able to compile the R package, even when 
> using the libarrow binary (see [ARROW-17594]) but that they can use 
> INSTALL_opts = "--build" to get a binary package that is installable on a 
> centos machine WITHOUT dts -> offline build section. Centos 7 RSPM is an 
> alternative source for that binary package. 
> Also add messaging in configure or build_arrow_static.sh so that if someone 
> is trying to install from source with gcc 4.8, we tell them what they need to 
> do.
> This should be documented and noted in the release notes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17983) [Parquet][C++][Python] "List index overflow" when read parquet file

2022-10-17 Thread Yibo Cai (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618804#comment-17618804
 ] 

Yibo Cai commented on ARROW-17983:
--

cc [~emkornfi...@gmail.com] for comments.

> [Parquet][C++][Python] "List index overflow" when read parquet file
> ---
>
> Key: ARROW-17983
> URL: https://issues.apache.org/jira/browse/ARROW-17983
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Reporter: Yibo Cai
>Priority: Major
>
> From issue https://github.com/apache/arrow/issues/14229.
> The bug looks like this:
> - create a pandas dataframe with *one column* and {{n}} rows, {{n < 
> max(int32)}}
> - each elemenet is a list with {{m}} integers, {{m * n > max(int32)}}
> - save to a parquet file
> - reading from the parquet file fails with "OSError: List index overflow"
> See comment below on details to reproudce this bug:
> https://github.com/apache/arrow/issues/14229#issuecomment-1272223773
> Tested with a small dataset, the error might come from below code.
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion.cc#L63-L64
> {{OffsetType}} is {{int32}}, but the loop is executed (and {{*offset}} is 
> incremented) {{m * n}} times which is beyond {{max(int32)}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18075) [Website] Update install page for 9.0.0

2022-10-17 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-18075:


 Summary: [Website] Update install page for 9.0.0
 Key: ARROW-18075
 URL: https://issues.apache.org/jira/browse/ARROW-18075
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-16603) [Python] pyarrow.json.read_json ignores nullable=False in explicit_schema parse_options

2022-10-17 Thread Alenka Frim (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-16603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim updated ARROW-16603:

Description: 
Reproducible example:
{code:python}
import json
import pyarrow.json as pj
import pyarrow as pa

s = {"id": "value", "nested": {"value": 1}}

with open("issue.json", "w") as write_file:
json.dump(s, write_file, indent=4)

schema = pa.schema([
pa.field("id", pa.string(), nullable=False),
pa.field("nested", pa.struct([pa.field("value", pa.int64(), 
nullable=False)]))
])

table = pj.read_json('issue.json', 
parse_options=pj.ParseOptions(explicit_schema=schema))

print(schema)
# id: string not null
# nested: struct
#   child 0, value: int64 not null 
print(table.schema)
# id: string
# nested: struct
#   child 0, value: int64{code}

  was:
Reproducible example:
{code:python}
import json
import pyarrow.json as pj
import pyarrow as pa

s = {"id": "value", "nested": {"value": 1}}

with open("issue.json", "w") as write_file:
json.dump(s, write_file, indent=4)

schema = pa.schema([
pa.field("id", pa.string(), nullable=False),
pa.field("nested", pa.struct([pa.field("value", pa.int64(), 
nullable=False)]))
])

table = pj.read_json('issue.json', 
parse_options=pj.ParseOptions(explicit_schema=schema))

print(schema)
print(table.schema)
{code}


> [Python] pyarrow.json.read_json ignores nullable=False in explicit_schema 
> parse_options
> ---
>
> Key: ARROW-16603
> URL: https://issues.apache.org/jira/browse/ARROW-16603
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Alenka Frim
>Priority: Major
>
> Reproducible example:
> {code:python}
> import json
> import pyarrow.json as pj
> import pyarrow as pa
> s = {"id": "value", "nested": {"value": 1}}
> with open("issue.json", "w") as write_file:
> json.dump(s, write_file, indent=4)
> schema = pa.schema([
> pa.field("id", pa.string(), nullable=False),
> pa.field("nested", pa.struct([pa.field("value", pa.int64(), 
> nullable=False)]))
> ])
> table = pj.read_json('issue.json', 
> parse_options=pj.ParseOptions(explicit_schema=schema))
> print(schema)
> # id: string not null
> # nested: struct
> #   child 0, value: int64 not null 
> print(table.schema)
> # id: string
> # nested: struct
> #   child 0, value: int64{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

66 matches

Mail list logo