[jira] [Created] (ARROW-12445) [Rust] Design and implement packaging process to bundle Rust in signed tar

2021-04-18 Thread Jira
Jorge Leitão created ARROW-12445:


 Summary: [Rust] Design and implement packaging process to bundle 
Rust in signed tar
 Key: ARROW-12445
 URL: https://issues.apache.org/jira/browse/ARROW-12445
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Affects Versions: 5.0.0
Reporter: Jorge Leitão
 Fix For: 5.0.0


The goal of this task is to agree on the strategy and process, and implement it 
in dev/release , to bundle Rust's source code on the signed tar as part of the 
arrow release.

Ideas:
1. use the latest published source code in crates.io
2. use a git ref from arrow-rs pointing to the latest release
3. use the source in arrow-rs@master

Some pros and cons:

1. 
* [pro] is is downloaded from within ASF
* [pro] it has been released (as it was voted on)
* [con] it is not integration-tested against latest arrow@master, only master 
at the time of the release
2.
* [pro] it has been officially released (as it was voted on)
* [con] is is downloaded from outside ASF
* [con] it is not integration-tested against latest arrow@master, only master 
at the time of the release
3. 
* [pro] It is the latest
* [pro] it is integration-tested against latest master
* [con] it has not been released in crates.io




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12444) [RUST] [CI] Remove Rust and point integration tests to arrow-rs repo

2021-04-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12444:
---
Labels: pull-request-available  (was: )

> [RUST] [CI] Remove Rust and point integration tests to arrow-rs repo
> 
>
> Key: ARROW-12444
> URL: https://issues.apache.org/jira/browse/ARROW-12444
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Goals:
> * Make integration tests run against arrow-rs@master
> * Remove Rust from apache/arrow
> Tasks:
> * Remove Rust from CI
> * Remove rust/
> * git clone apache/arrow-rs@master in integration tests
> * remove rust from Archery Lint
> * remove rust from PR labeler
> * remove rust from detect_changes.py



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12399) Unable to load libhdfs

2021-04-18 Thread Sukesh Pabolu (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324684#comment-17324684
 ] 

Sukesh Pabolu commented on ARROW-12399:
---

I am waiting for further reply

> Unable to load libhdfs
> --
>
> Key: ARROW-12399
> URL: https://issues.apache.org/jira/browse/ARROW-12399
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Sukesh Pabolu
>Priority: Major
>  Labels: filesystem, hdfs
> Fix For: 3.0.0
>
> Attachments: image-2021-04-15-20-04-50-069.png
>
>
> I am using pyarrow 3.0.0 with python 3.7 and hadoop 2.10.1 on windows 10 
> 64bit. Facing this following error. 
> I am using pyspark 3.1.1. I am not able to save dataframe to hdfs. When I 
> used pyspark 3.0.0 I was able to save dataframe hdfs.
> *please help:*
> *import pyarrow as pa*
>  *fs = pa.hdfs.connect(host='localhost', port=9001)*
>  __main__:1: DeprecationWarning: pyarrow.hdfs.connect is deprecated as of 
> 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File 
> "C:\Users\1570513\Anaconda3\envs\on-premise-latest\lib\site-packages\pyarrow\hdfs.py",
>  line 219, in connect
>  extra_conf=extra_conf
>  File 
> "C:\Users\1570513\Anaconda3\envs\on-premise-latest\lib\site-packages\pyarrow\hdfs.py",
>  line 229, in _connect
>  extra_conf=extra_conf)
>  File 
> "C:\Users\1570513\Anaconda3\envs\on-premise-latest\lib\site-packages\pyarrow\hdfs.py",
>  line 45, in __init__
>  self._connect(host, port, user, kerb_ticket, extra_conf)
>  File "pyarrow\io-hdfs.pxi", line 75, in pyarrow.lib.HadoopFileSystem._connect
>  File "pyarrow\error.pxi", line 99, in pyarrow.lib.check_status
>  OSError: Unable to load libhdfs: The specified module could not be found.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12444) [RUST] [CI] Remove Rust and point integration tests to arrow-rs repo

2021-04-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-12444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324676#comment-17324676
 ] 

Jorge Leitão commented on ARROW-12444:
--

cc [~andygrove] [~wes] [~alamb] [~kszucs]

> [RUST] [CI] Remove Rust and point integration tests to arrow-rs repo
> 
>
> Key: ARROW-12444
> URL: https://issues.apache.org/jira/browse/ARROW-12444
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>
> Goals:
> * Make integration tests run against arrow-rs@master
> * Remove Rust from apache/arrow
> Tasks:
> * Remove Rust from CI
> * Remove rust/
> * git clone apache/arrow-rs@master in integration tests
> * remove rust from Archery Lint
> * remove rust from PR labeler
> * remove rust from detect_changes.py



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12444) [RUST] [CI] Remove Rust and point integration tests to arrow-rs repo

2021-04-18 Thread Jira
Jorge Leitão created ARROW-12444:


 Summary: [RUST] [CI] Remove Rust and point integration tests to 
arrow-rs repo
 Key: ARROW-12444
 URL: https://issues.apache.org/jira/browse/ARROW-12444
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Jorge Leitão
Assignee: Jorge Leitão


Goals:

* Make integration tests run against arrow-rs@master
* Remove Rust from apache/arrow

Tasks:

* Remove Rust from CI
* Remove rust/
* git clone apache/arrow-rs@master in integration tests
* remove rust from Archery Lint
* remove rust from PR labeler
* remove rust from detect_changes.py





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12443) [C++][Gandiva] Implement castVARCHAR function for binary input

2021-04-18 Thread Jira
João Pedro Antunes Ferreira created ARROW-12443:
---

 Summary: [C++][Gandiva] Implement castVARCHAR function for binary 
input
 Key: ARROW-12443
 URL: https://issues.apache.org/jira/browse/ARROW-12443
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Gandiva
Reporter: João Pedro Antunes Ferreira
Assignee: João Pedro Antunes Ferreira


Implement castVARCHAR function for binary input



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12420) [C++/Dataset] Reading null columns as dictionary not longer possible

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-12420:
---

Assignee: Krisztian Szucs

> [C++/Dataset] Reading null columns as dictionary not longer possible
> 
>
> Key: ARROW-12420
> URL: https://issues.apache.org/jira/browse/ARROW-12420
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 4.0.0
>Reporter: Uwe Korn
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Reading a dataset with a dictionary column where some of the files don't 
> contain any data for that column (and thus are typed as null) broke with 
> https://github.com/apache/arrow/pull/9532. It worked with the 3.0 release 
> though and thus I would consider this a regression.
> This can be reproduced using the following Python snippet:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({"a": [None, None]})
> pq.write_table(table, "test.parquet")
> schema = pa.schema([pa.field("a", pa.dictionary(pa.int32(), pa.string()))])
> fsds = ds.FileSystemDataset.from_paths(
> paths=["test.parquet"],
> schema=schema,
> format=pa.dataset.ParquetFileFormat(),
> filesystem=pa.fs.LocalFileSystem(),
> )
> fsds.to_table()
> {code}
> The exception on master is currently:
> {code}
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
>   6 filesystem=pa.fs.LocalFileSystem(),
>   7 )
> > 8 fsds.to_table()
> ~/Development/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.Dataset.to_table()
> 456 table : Table instance
> 457 """
> --> 458 return self._scanner(**kwargs).to_table()
> 459 
> 460 def head(self, int num_rows, **kwargs):
> ~/Development/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.Scanner.to_table()
>2887 result = self.scanner.ToTable()
>2888 
> -> 2889 return pyarrow_wrap_table(GetResultValue(result))
>2890 
>2891 def take(self, object indices):
> ~/Development/arrow/python/pyarrow/error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> 139 cdef api int pyarrow_internal_check_status(const CStatus& status) \
> 140 nogil except -1:
> --> 141 return check_status(status)
> 142 
> 143 
> ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> 116 raise ArrowKeyError(message)
> 117 elif status.IsNotImplemented():
> --> 118 raise ArrowNotImplementedError(message)
> 119 elif status.IsTypeError():
> 120 raise ArrowTypeError(message)
> ArrowNotImplementedError: Unsupported cast from null to 
> dictionary (no available cast 
> function for target type)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12420) [C++/Dataset] Reading null columns as dictionary not longer possible

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-12420.
-
Resolution: Fixed

Issue resolved by pull request 10093
[https://github.com/apache/arrow/pull/10093]

> [C++/Dataset] Reading null columns as dictionary not longer possible
> 
>
> Key: ARROW-12420
> URL: https://issues.apache.org/jira/browse/ARROW-12420
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 4.0.0
>Reporter: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Reading a dataset with a dictionary column where some of the files don't 
> contain any data for that column (and thus are typed as null) broke with 
> https://github.com/apache/arrow/pull/9532. It worked with the 3.0 release 
> though and thus I would consider this a regression.
> This can be reproduced using the following Python snippet:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({"a": [None, None]})
> pq.write_table(table, "test.parquet")
> schema = pa.schema([pa.field("a", pa.dictionary(pa.int32(), pa.string()))])
> fsds = ds.FileSystemDataset.from_paths(
> paths=["test.parquet"],
> schema=schema,
> format=pa.dataset.ParquetFileFormat(),
> filesystem=pa.fs.LocalFileSystem(),
> )
> fsds.to_table()
> {code}
> The exception on master is currently:
> {code}
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
>   6 filesystem=pa.fs.LocalFileSystem(),
>   7 )
> > 8 fsds.to_table()
> ~/Development/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.Dataset.to_table()
> 456 table : Table instance
> 457 """
> --> 458 return self._scanner(**kwargs).to_table()
> 459 
> 460 def head(self, int num_rows, **kwargs):
> ~/Development/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.Scanner.to_table()
>2887 result = self.scanner.ToTable()
>2888 
> -> 2889 return pyarrow_wrap_table(GetResultValue(result))
>2890 
>2891 def take(self, object indices):
> ~/Development/arrow/python/pyarrow/error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> 139 cdef api int pyarrow_internal_check_status(const CStatus& status) \
> 140 nogil except -1:
> --> 141 return check_status(status)
> 142 
> 143 
> ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> 116 raise ArrowKeyError(message)
> 117 elif status.IsNotImplemented():
> --> 118 raise ArrowNotImplementedError(message)
> 119 elif status.IsTypeError():
> 120 raise ArrowTypeError(message)
> ArrowNotImplementedError: Unsupported cast from null to 
> dictionary (no available cast 
> function for target type)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11400) [Python] Pickled ParquetFileFragment has invalid partition_expresion with dictionary type in pyarrow 2.0

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11400:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [Python] Pickled ParquetFileFragment has invalid partition_expresion with 
> dictionary type in pyarrow 2.0
> 
>
> Key: ARROW-11400
> URL: https://issues.apache.org/jira/browse/ARROW-11400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset, pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> From https://github.com/dask/dask/pull/7066#issuecomment-767156623
> Simplified reproducer:
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({'part': ['A', 'B']*5, 'col': range(10)})
> pq.write_to_dataset(table, "test_partitioned_parquet", 
> partition_cols=["part"])
> # with partitioning_kwargs = {} there is no error
> partitioning_kwargs = {"max_partition_dictionary_size": -1}
> dataset = ds.dataset(
> "test_partitioned_parquet/", format="parquet", 
> partitioning=ds.HivePartitioning.discover( **partitioning_kwargs)
> )
> frag = list(dataset.get_fragments())[0]
> {code}
> Querying this fragment works fine, but after serialization/deserialization 
> with pickle, it gives errors (and with the original data example I actually 
> got a segfault as well):
> {code}
> In [16]: import pickle
> In [17]: frag2 = pickle.loads(pickle.dumps(frag))
> In [19]: frag2.partition_expression
> ...
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16: 
> invalid continuation byte
> In [20]: frag2.to_table(schema=schema, columns=columns)
> Out[20]: 
> pyarrow.Table
> col: int64
> part: dictionary
> In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas()
> ...
> ~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.table_to_blocks()
> ArrowException: Unknown error: Wrapping ɻ� failed
> {code}
> It seems the issue was specifically with a partition expression with 
> dictionary type. 
> Also when using an integer columns as the partition column, you get wrong 
> values (but silently in this case):
> {code:python}
> In [42]: frag.partition_expression
> Out[42]: 
>1,
>   2
> ][0]:dictionary)>
> In [43]: frag2.partition_expression
> Out[43]: 
>170145232,
>   32754
> ][0]:dictionary)>
> {code}
> Now, it seems this is fixed in master. But since I don't remember it was 
> fixed intentionally ([~bkietz]?), it would be good to add some tests for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12437) [Rust] [Ballista] Ballista plans must not include RepartitionExec

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-12437:

Fix Version/s: (was: 5.0.0)
   4.0.0

> [Rust] [Ballista] Ballista plans must not include RepartitionExec
> -
>
> Key: ARROW-12437
> URL: https://issues.apache.org/jira/browse/ARROW-12437
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Ballista plans must not include RepartitionExec because it results in 
> incorrect results. Ballista needs to manage its own repartitioning in a 
> distributed-aware way later on. For now we just need to configure the 
> DataFusion context to disable repartition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12419) [Java] flatc is not used in mvn

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-12419:

Fix Version/s: (was: 5.0.0)
   4.0.0

> [Java] flatc is not used in mvn
> ---
>
> Key: ARROW-12419
> URL: https://issues.apache.org/jira/browse/ARROW-12419
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 4.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> ARROW-12111 removed the usage of flatc during the build process in mvn. Thus, 
> it is not necessary to explicitly download flatc for s390x.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12424) [Go][Parquet] Add Schema Package

2021-04-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-12424:
-
Summary: [Go][Parquet] Add Schema Package  (was: Add Schema Package)

> [Go][Parquet] Add Schema Package
> 
>
> Key: ARROW-12424
> URL: https://issues.apache.org/jira/browse/ARROW-12424
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go, Parquet
>Reporter: Matt Topol
>Assignee: Matt Topol
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Adding the ported code for the Schema module for Go Parquet library.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12423) [Docs] Codecov badge in main Readme only applies to Rust

2021-04-18 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324611#comment-17324611
 ] 

Kouhei Sutou commented on ARROW-12423:
--

We can remove it because the Rust code will move to 
https://github.com/apache/arrow-rs and 
https://github.com/apache/arrow-datafusion .

> [Docs] Codecov badge in main Readme only applies to Rust
> 
>
> Key: ARROW-12423
> URL: https://issues.apache.org/jira/browse/ARROW-12423
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Dominik Moritz
>Priority: Major
>
> The badge in https://github.com/apache/arrow/blob/master/README.md links to 
> https://app.codecov.io/gh/apache/arrow, which seems to only show the coverage 
> for the Rust code. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12423) [Docs] Codecov badge in main Readme only applies to Rust

2021-04-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-12423:
-
Component/s: Documentation

> [Docs] Codecov badge in main Readme only applies to Rust
> 
>
> Key: ARROW-12423
> URL: https://issues.apache.org/jira/browse/ARROW-12423
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Dominik Moritz
>Priority: Major
>
> The badge in https://github.com/apache/arrow/blob/master/README.md links to 
> https://app.codecov.io/gh/apache/arrow, which seems to only show the coverage 
> for the Rust code. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12423) [Docs] Codecov badge in main Readme only applies to Rust

2021-04-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-12423:
-
Summary: [Docs] Codecov badge in main Readme only applies to Rust  (was: 
Codecov badge in main Readme only applies to Rust)

> [Docs] Codecov badge in main Readme only applies to Rust
> 
>
> Key: ARROW-12423
> URL: https://issues.apache.org/jira/browse/ARROW-12423
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Dominik Moritz
>Priority: Major
>
> The badge in https://github.com/apache/arrow/blob/master/README.md links to 
> https://app.codecov.io/gh/apache/arrow, which seems to only show the coverage 
> for the Rust code. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12442) [CI] Set job timeouts on GitHub Actions

2021-04-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-12442:
-
Summary: [CI] Set job timeouts on GitHub Actions  (was: [CI] Set job 
timeouts on Github Actions)

> [CI] Set job timeouts on GitHub Actions
> ---
>
> Key: ARROW-12442
> URL: https://issues.apache.org/jira/browse/ARROW-12442
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Critical
>
> The default timeout for a single job in Github Actions is 6 hours:
> [https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes]
> While our jobs normally do not exceed 1 hour of runtime (and most of them are 
> far quicker), sometimes some network issues may lead a job to take up the 
> full 6 hours before timing out. Not only does this contribute to our own 
> build queue growing unnecessarily, but it also impedes other Apache projects, 
> since the number of jobs which can be run in parallel is capped at the 
> organization level.
> We should therefore configure job timeouts which reflect our expectation of 
> the overall runtime for each job. 1 hour should be a safe value for most of 
> them, and would already dramatically reduce the impact of network issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-11524) [Rust][DataFusion] TPC-H Query 9

2021-04-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-11524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniël Heres reassigned ARROW-11524:


Assignee: Daniël Heres

> [Rust][DataFusion] TPC-H Query 9
> 
>
> Key: ARROW-11524
> URL: https://issues.apache.org/jira/browse/ARROW-11524
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Daniël Heres
>Assignee: Daniël Heres
>Priority: Major
>
> Fails with error "Cartesian joins are not supported". Seems that DataFusion 
> can not convert it to a normal join (+filter).
> {{Error: NotImplemented("Cartesian joins are not supported")}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11897) [Rust][Parquet] Use iterators to increase performance of creating Arrow arrays

2021-04-18 Thread Yordan Pavlov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324597#comment-17324597
 ] 

Yordan Pavlov commented on ARROW-11897:
---

UPDATE: over the past few days I managed to finish the core implementation of 
the new ArrowArrayReader with the key bits being:
 * the converters will only produce an all-value / no-null ArrayData instance - 
this simplifies the converter interface and keeps all other logic generic
 * if no def levels are available, this no-null ArrayData produced from the 
converter is simply converted to an array and returned without changes
 * if def levels are available, a BooleanArray is created from the def levels 
and used to efficiently determine how many values to read and also efficiently 
insert NULLs using MutableArrayData (with an algorithm very similar to zip()) - 
this implementation re-uses as much of the existing arrow code as possible
 * the StringArray converter has been implemented as a function before moving 
to a converter in a later change

Next steps are:
 * implement decoder iterator for def / rep levels
 * implement decoder iterator for plain encoding
 * make unit test pass
 * attempt to replace ComplexObjectArrayReader for StringArrays
 * benchmark performance
 * create initial PR

the latest changes can be found here:

https://github.com/yordan-pavlov/arrow/commit/7299f2a747cc52237c21b9d85df994a66097d731#diff-dce1a37fc60ea0c8d13a61bf530abbf9f82aef43224597f31a7ba4d9fe7bd10dR418

> [Rust][Parquet] Use iterators to increase performance of creating Arrow arrays
> --
>
> Key: ARROW-11897
> URL: https://issues.apache.org/jira/browse/ARROW-11897
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Yordan Pavlov
>Priority: Major
>
> The overall goal is to create an efficient pipeline from Parquet page data 
> into Arrow arrays, with as little intermediate conversion and memory 
> allocation as possible. It is assumed, that for best performance, we favor 
> doing fewer but larger copy operations (rather than more but smaller). 
> Such a pipeline would need to be flexible in order to enable high performance 
> implementations in several different cases:
>  (1) In some cases, such as plain-encoded number array, it might even be 
> possible to copy / create the array from a single contiguous section from a 
> page buffer. 
>  (2) In other cases, such as plain-encoded string array, since values are 
> encoded in non-contiguous slices (where value bytes are separated by length 
> bytes) in a page buffer contains multiple values, individual values will have 
> to be copied separately and it's not obvious how this can be avoided.
>  (3) Finally, in the case of bit-packing encoding and smaller numeric values, 
> page buffer data has to be decoded / expanded before it is ready to copy into 
> an arrow arrow, so a `Vec` will have to be returned instead of a slice 
> pointing to a page buffer.
> I propose that the implementation is split into three layers - (1) decoder, 
> (2) column reader and (3) array converter layers (not too dissimilar from the 
> current implementation, except it would be based on Iterators), as follows:
> *(1) Decoder layer:*
> A decoder output abstraction that enables all of the above cases and 
> minimizes intermediate memory allocation is `Iterator AsRef<[u8]>)>`.
>  Then in case (1) above, where a numeric array could be created from a single 
> contiguous byte slice, such an iterator could return a single item such as 
> `(1024, &[u8])`. 
>  In case (2) above, where each string value is encoded as an individual byte 
> slice, but it is still possible to copy directly from a page buffer, a 
> decoder iterator could return a sequence of items such as `(1, &[u8])`. 
>  And finally in case (3) above, where bit-packed values have to be 
> unpacked/expanded, and it's NOT possible to copy value bytes directly from a 
> page buffer, a decoder iterator could return items representing chunks of 
> values such as `(32, Vec)` where bit-packed values have been unpacked and 
>  the chunk size is configured for best performance.
> Another benefit of an `Iterator`-based abstraction is that it would prepare 
> the parquet crate for  migration to `async` `Stream`s (my understanding is 
> that a `Stream` is effectively an async `Iterator`).
> *(2) Column reader layer:*
> Then a higher level iterator could combine a value iterator and a (def) level 
> iterator to produce a sequence of `ValueSequence(count, AsRef<[u8]>)` and 
> `NullSequence(count)` items from which an arrow array can be created 
> efficiently.
> In future, a higher level iterator (for the keys) could be combined with a 
> dictionary value iterator to create a dictionary array.
> *(3) Array converter layer:*
> 

[jira] [Updated] (ARROW-11400) [Python] Pickled ParquetFileFragment has invalid partition_expresion with dictionary type in pyarrow 2.0

2021-04-18 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11400:
--
Fix Version/s: (was: 4.0.0)
   3.0.0

> [Python] Pickled ParquetFileFragment has invalid partition_expresion with 
> dictionary type in pyarrow 2.0
> 
>
> Key: ARROW-11400
> URL: https://issues.apache.org/jira/browse/ARROW-11400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> From https://github.com/dask/dask/pull/7066#issuecomment-767156623
> Simplified reproducer:
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({'part': ['A', 'B']*5, 'col': range(10)})
> pq.write_to_dataset(table, "test_partitioned_parquet", 
> partition_cols=["part"])
> # with partitioning_kwargs = {} there is no error
> partitioning_kwargs = {"max_partition_dictionary_size": -1}
> dataset = ds.dataset(
> "test_partitioned_parquet/", format="parquet", 
> partitioning=ds.HivePartitioning.discover( **partitioning_kwargs)
> )
> frag = list(dataset.get_fragments())[0]
> {code}
> Querying this fragment works fine, but after serialization/deserialization 
> with pickle, it gives errors (and with the original data example I actually 
> got a segfault as well):
> {code}
> In [16]: import pickle
> In [17]: frag2 = pickle.loads(pickle.dumps(frag))
> In [19]: frag2.partition_expression
> ...
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16: 
> invalid continuation byte
> In [20]: frag2.to_table(schema=schema, columns=columns)
> Out[20]: 
> pyarrow.Table
> col: int64
> part: dictionary
> In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas()
> ...
> ~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.table_to_blocks()
> ArrowException: Unknown error: Wrapping ɻ� failed
> {code}
> It seems the issue was specifically with a partition expression with 
> dictionary type. 
> Also when using an integer columns as the partition column, you get wrong 
> values (but silently in this case):
> {code:python}
> In [42]: frag.partition_expression
> Out[42]: 
>1,
>   2
> ][0]:dictionary)>
> In [43]: frag2.partition_expression
> Out[43]: 
>170145232,
>   32754
> ][0]:dictionary)>
> {code}
> Now, it seems this is fixed in master. But since I don't remember it was 
> fixed intentionally ([~bkietz]?), it would be good to add some tests for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12434) [Rust] [Ballista] Show executed plans with metrics

2021-04-18 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-12434.

Resolution: Fixed

PR was merged

> [Rust] [Ballista] Show executed plans with metrics
> --
>
> Key: ARROW-12434
> URL: https://issues.apache.org/jira/browse/ARROW-12434
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Show executed plans with metrics to help with debugging and performance tuning



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12261) [Rust] [Ballista] Ballista should not have its own DataFrame API

2021-04-18 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-12261.
--
Resolution: Fixed

Moved to https://github.com/apache/arrow-datafusion/issues/2

> [Rust] [Ballista] Ballista should not have its own DataFrame API
> 
>
> Key: ARROW-12261
> URL: https://issues.apache.org/jira/browse/ARROW-12261
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Priority: Major
> Fix For: 5.0.0
>
>
> When building the Ballista POC it was necessary to implement a new DataFrame 
> API that wrapped the DataFusion API.
> One issue is that it wasn't possible to override the behavior of the collect 
> method to make it use the Ballista context rather than the DataFusion context.
> Now that the projects are in the same repo it should be easier to fix this 
> and have users always use the DataFusion DataFrame API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12441) [Rust][DataFusion] Support cartesian join

2021-04-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12441:
---
Labels: pull-request-available  (was: )

> [Rust][DataFusion] Support cartesian join
> -
>
> Key: ARROW-12441
> URL: https://issues.apache.org/jira/browse/ARROW-12441
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Daniël Heres
>Assignee: Daniël Heres
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12441) [Rust][DataFusion] Support cartesian join

2021-04-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniël Heres updated ARROW-12441:
-
Issue Type: New Feature  (was: Bug)

> [Rust][DataFusion] Support cartesian join
> -
>
> Key: ARROW-12441
> URL: https://issues.apache.org/jira/browse/ARROW-12441
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Daniël Heres
>Assignee: Daniël Heres
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8621) [Go] Add Module support by creating tags

2021-04-18 Thread Matt Topol (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324577#comment-17324577
 ] 

Matt Topol commented on ARROW-8621:
---

[~kszucs] This is the JIRA issue I was referring to on the arrow-dev mailing 
list. If I knew more about how the tags are created in the release scripts I'd 
offer to make the change myself. This would significantly improve keeping the 
versioning the same across the different modules for languages.

Also with the addition of the go parquet library, this would involve both the 
tags:

`go/arrow/v4.0.0` / `go/arrow/v4.0.0-rc0` for release candidates
`go/parquet/v4.0.0`

Thanks

> [Go] Add Module support by creating tags
> 
>
> Key: ARROW-8621
> URL: https://issues.apache.org/jira/browse/ARROW-8621
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Kyle Brandt
>Priority: Minor
>
> Arrow has a go.mod, but the go modules system expects a certain git tag for 
> Go modules to work.
> Based on 
> [https://github.com/golang/go/wiki/Modules#faqs--multi-module-repositories] I 
> believe the tag would be 
> {code}
> go/arrow/v0.17.0
> {code}
> Currently:
> {code}
> $ go get github.com/apache/arrow/go/arrow@v0.17.0 
> go get github.com/apache/arrow/go/arrow@v0.17.0: 
> github.com/apache/arrow/go/arrow@v0.17.0: invalid version: unknown revision 
> go/arrow/v0.17.0
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12420) [C++/Dataset] Reading null columns as dictionary not longer possible

2021-04-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12420:
---
Labels: pull-request-available  (was: )

> [C++/Dataset] Reading null columns as dictionary not longer possible
> 
>
> Key: ARROW-12420
> URL: https://issues.apache.org/jira/browse/ARROW-12420
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 4.0.0
>Reporter: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Reading a dataset with a dictionary column where some of the files don't 
> contain any data for that column (and thus are typed as null) broke with 
> https://github.com/apache/arrow/pull/9532. It worked with the 3.0 release 
> though and thus I would consider this a regression.
> This can be reproduced using the following Python snippet:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({"a": [None, None]})
> pq.write_table(table, "test.parquet")
> schema = pa.schema([pa.field("a", pa.dictionary(pa.int32(), pa.string()))])
> fsds = ds.FileSystemDataset.from_paths(
> paths=["test.parquet"],
> schema=schema,
> format=pa.dataset.ParquetFileFormat(),
> filesystem=pa.fs.LocalFileSystem(),
> )
> fsds.to_table()
> {code}
> The exception on master is currently:
> {code}
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
>   6 filesystem=pa.fs.LocalFileSystem(),
>   7 )
> > 8 fsds.to_table()
> ~/Development/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.Dataset.to_table()
> 456 table : Table instance
> 457 """
> --> 458 return self._scanner(**kwargs).to_table()
> 459 
> 460 def head(self, int num_rows, **kwargs):
> ~/Development/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.Scanner.to_table()
>2887 result = self.scanner.ToTable()
>2888 
> -> 2889 return pyarrow_wrap_table(GetResultValue(result))
>2890 
>2891 def take(self, object indices):
> ~/Development/arrow/python/pyarrow/error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> 139 cdef api int pyarrow_internal_check_status(const CStatus& status) \
> 140 nogil except -1:
> --> 141 return check_status(status)
> 142 
> 143 
> ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> 116 raise ArrowKeyError(message)
> 117 elif status.IsNotImplemented():
> --> 118 raise ArrowNotImplementedError(message)
> 119 elif status.IsTypeError():
> 120 raise ArrowTypeError(message)
> ArrowNotImplementedError: Unsupported cast from null to 
> dictionary (no available cast 
> function for target type)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12429) [C++] MergedGeneratorTestFixture is incorrectly instantiated

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-12429:

Fix Version/s: (was: 5.0.0)
   4.0.0

> [C++] MergedGeneratorTestFixture is incorrectly instantiated
> 
>
> Key: ARROW-12429
> URL: https://issues.apache.org/jira/browse/ARROW-12429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> [https://gist.github.com/kou/868eaed328b348e45865747044044272#file-source-cpp-txt]
> Looks like the base class was accidentally instantiated instead of the actual 
> test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12432) [Rust] [DataFusion] Add metrics for SortExec

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-12432:

Fix Version/s: (was: 5.0.0)
   4.0.0

> [Rust] [DataFusion] Add metrics for SortExec
> 
>
> Key: ARROW-12432
> URL: https://issues.apache.org/jira/browse/ARROW-12432
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Add metrics for SortExec



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11999) [Java] Support parallel vector element search with user-specified comparator

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11999:

Fix Version/s: (was: 5.0.0)
   4.0.0

> [Java] Support parallel vector element search with user-specified comparator
> 
>
> Key: ARROW-11999
> URL: https://issues.apache.org/jira/browse/ARROW-11999
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This is in response to the discussion in 
> https://github.com/apache/arrow/pull/5631#discussion_r339110228
> Currently, we only support parallel search with {{RangeEqualsVisitor}}, which 
> does not support user-specified comparators.
> We want to provide the functionality in this issue to support wider range of 
> use cases. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12334) [Rust] [Ballista] Aggregate queries producing incorrect results

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-12334:

Fix Version/s: (was: 5.0.0)
   4.0.0

> [Rust] [Ballista] Aggregate queries producing incorrect results
> ---
>
> Key: ARROW-12334
> URL: https://issues.apache.org/jira/browse/ARROW-12334
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> I just ran benchmarks for the first time in a while and I see duplicate 
> entries for group by keys.
>  
> For example, query 1 has "group by l_returnflag, l_linestatus" and I see 
> multiple results with l_returnflag = 'A' and l_linestatus = 'F'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12104) Next Chunk of ported Code

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-12104:

Fix Version/s: (was: 5.0.0)
   4.0.0

> Next Chunk of ported Code
> -
>
> Key: ARROW-12104
> URL: https://issues.apache.org/jira/browse/ARROW-12104
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matt Topol
>Assignee: Matt Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Second chunk of ported code contains the Thrift Generated code, and the 
> frameworks for the Encryption, Compression and Properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12436) [Rust][Ballista] Add watch capabilities to config backend trait

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-12436:

Fix Version/s: (was: 5.0.0)
   4.0.0

> [Rust][Ballista] Add watch capabilities to config backend trait
> ---
>
> Key: ARROW-12436
> URL: https://issues.apache.org/jira/browse/ARROW-12436
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust - Ballista
>Reporter: Ximo Guanter
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> [arrow/lib.rs at 66aa3e7c365a8d4c4eca6e23668f2988e714b493 · apache/arrow 
> (github.com)|https://github.com/apache/arrow/blob/66aa3e7c365a8d4c4eca6e23668f2988e714b493/rust/ballista/rust/scheduler/src/lib.rs#L183]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12111) [Java] place files generated by flatc under source control

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-12111:

Fix Version/s: (was: 5.0.0)
   4.0.0

> [Java] place files generated by flatc under source control
> --
>
> Key: ARROW-12111
> URL: https://issues.apache.org/jira/browse/ARROW-12111
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Bob Tinsman
>Assignee: Bob Tinsman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The Arrow binary format is implemented with Flatbuffers specification files 
> (_*.fbs_ in the top-level _format_ directory). The _flatc_ binary is used to 
> generate source files for various implementation languages.
> The Java build does the generation as part of every build. However, these 
> languages have _flatc-_generated files under source control:
>  * C++
>  * Rust
>  * Javascript
>  * C#
> Java can do this as well, removing the build dependency on _flatc_ (currently 
> provided by an unofficial Maven artifact, not available under Windows). The 
> Java build doc can be updated to reflect this change and document how to 
> generate and check in files when the binary format changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12440) [Release] Various packaging, release script and release verification script fixes

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-12440:
---

Assignee: Krisztian Szucs

> [Release] Various packaging, release script and release verification script 
> fixes
> -
>
> Key: ARROW-12440
> URL: https://issues.apache.org/jira/browse/ARROW-12440
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools, Packaging
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Fixes for issues surfaced during the preparation of 4.0.0-RC0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12440) [Release] Various packaging, release script and release verification script fixes

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-12440.
-
Resolution: Fixed

Issue resolved by pull request 10091
[https://github.com/apache/arrow/pull/10091]

> [Release] Various packaging, release script and release verification script 
> fixes
> -
>
> Key: ARROW-12440
> URL: https://issues.apache.org/jira/browse/ARROW-12440
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools, Packaging
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Fixes for issues surfaced during the preparation of 4.0.0-RC0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11317) [Rust] Test the prettyprint feature in CI

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11317:

Fix Version/s: 4.0.0

> [Rust] Test the prettyprint feature in CI
> -
>
> Key: ARROW-11317
> URL: https://issues.apache.org/jira/browse/ARROW-11317
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11320) [C++] Spurious test failure when creating temporary dir

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11320:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++] Spurious test failure when creating temporary dir
> ---
>
> Key: ARROW-11320
> URL: https://issues.apache.org/jira/browse/ARROW-11320
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> When running the release verification script, I sometimes get this error:
> {code}
> [--] 5 tests from TestInt8/TestSparseTensorRoundTrip/0, where 
> TypeParam = arrow::Int8Type
> [ RUN  ] TestInt8/TestSparseTensorRoundTrip/0.WithSparseCOOIndexRowMajor
> /tmp/arrow-3.0.0.4SRpe/apache-arrow-3.0.0/cpp/src/arrow/ipc/tensor_test.cc:53:
>  Failure
> Failed
> '_error_or_value8.status()' failed with IOError: Path already exists: 
> '/tmp/ipc-test-qj6ng827/'
> [  FAILED  ] TestInt8/TestSparseTensorRoundTrip/0.WithSparseCOOIndexRowMajor, 
> where TypeParam = arrow::Int8Type (0 ms)
> [ RUN  ] 
> TestInt8/TestSparseTensorRoundTrip/0.WithSparseCOOIndexColumnMajor
> [   OK ] 
> TestInt8/TestSparseTensorRoundTrip/0.WithSparseCOOIndexColumnMajor (0 ms)
> [ RUN  ] TestInt8/TestSparseTensorRoundTrip/0.WithSparseCSRIndex
> [   OK ] TestInt8/TestSparseTensorRoundTrip/0.WithSparseCSRIndex (0 ms)
> [ RUN  ] TestInt8/TestSparseTensorRoundTrip/0.WithSparseCSCIndex
> [   OK ] TestInt8/TestSparseTensorRoundTrip/0.WithSparseCSCIndex (0 ms)
> [ RUN  ] TestInt8/TestSparseTensorRoundTrip/0.WithSparseCSFIndex
> [   OK ] TestInt8/TestSparseTensorRoundTrip/0.WithSparseCSFIndex (1 ms)
> [--] 5 tests from TestInt8/TestSparseTensorRoundTrip/0 (1 ms total)
> {code}
> It seems that in some fringe cases, the random generation of temporary 
> directory names produces duplicates. Most likely this means the random 
> generator is getting the same seed from different processes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10370) [Python] Spurious s3fs-related test failures

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-10370:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [Python] Spurious s3fs-related test failures
> 
>
> Key: ARROW-10370
> URL: https://issues.apache.org/jira/browse/ARROW-10370
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> I frequently get this error when running the Python test suite:
> {code}
> _ 
> test_write_to_dataset_pathlib_nonlocal[False] 
> _
> Traceback (most recent call last):
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/s3fs/core.py",
>  line 984, in _initiate_upload
> Bucket=self.bucket, Key=self.key, ACL=self.acl)
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/s3fs/core.py",
>  line 971, in _call_s3
> **kwargs)
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/s3fs/core.py",
>  line 189, in _call_s3
> return method(**additional_kwargs)
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/botocore/client.py",
>  line 357, in _api_call
> return self._make_api_call(operation_name, kwargs)
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/botocore/client.py",
>  line 661, in _make_api_call
> raise error_class(parsed_response, operation_name)
> botocore.errorfactory.NoSuchBucket: An error occurred (NoSuchBucket) when 
> calling the CreateMultipartUpload operation: The specified bucket does not 
> exist
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "/home/antoine/arrow/dev/python/pyarrow/tests/test_parquet.py", line 
> 2978, in test_write_to_dataset_pathlib_nonlocal
> tempdir / "test1", use_legacy_dataset, filesystem=fs)
>   File "/home/antoine/arrow/dev/python/pyarrow/tests/test_parquet.py", line 
> 2853, in _test_write_to_dataset_with_partitions
> pq.write_metadata(output_table.schema, f)
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/fsspec/spec.py",
>  line 1457, in __exit__
> self.close()
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/fsspec/spec.py",
>  line 1425, in close
> self.flush(force=True)
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/fsspec/spec.py",
>  line 1297, in flush
> self._initiate_upload()
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/s3fs/core.py",
>  line 986, in _initiate_upload
> raise translate_boto_error(e)
> FileNotFoundError: The specified bucket does not exist
> - 
> Captured stderr call 
> --
> Exception ignored in: Exception ignored in: Exception ignored in:  AbstractBufferedFile.__del__ at 0x7f1b119097a0>
> Traceback (most recent call last):
> 
> Exception ignored in:   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/fsspec/spec.py",
>  line 1446, in __del__
> 
> Traceback (most recent call last):
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/fsspec/spec.py",
>  line 1446, in __del__
> 
> Exception ignored in: Traceback (most recent call last):
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/fsspec/spec.py",
>  line 1446, in __del__
> Traceback (most recent call last):
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/fsspec/spec.py",
>  line 1446, in __del__
> Exception 
> ignored in: 
> self.close()
> self.close()
> Traceback (most recent call last):
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/fsspec/spec.py",
>  line 1446, in __del__
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/fsspec/spec.py",
>  line 1425, in close
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/fsspec/spec.py",
>  line 1425, in close
> self.flush(force=True)
> self.flush(force=True)
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/fsspec/spec.py",
>  line 1297, in flush
>   File 
> "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/fsspec/spec.py",
>  line 1297, in flush
> self.close()
>   

[jira] [Updated] (ARROW-11299) [Python] build warning in python

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11299:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [Python] build warning in python
> 
>
> Key: ARROW-11299
> URL: https://issues.apache.org/jira/browse/ARROW-11299
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 2.0.0
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Many warnings about compute kernel options when building Arrow python.
> Removing below line suppresses the warnings.
> https://github.com/apache/arrow/blob/140135908c5d131ceac31a0e529f9b9b763b1106/cpp/src/arrow/compute/function.h#L45
> I think the reason is virtual destructor makes the structure non C compatible 
> and cannot use offsetof macro safely.
> As function options are straightforward, looks destructor is not necessary.
> [~bkietz]
> *Steps to reproduce*
> build arrow cpp
> {code:bash}
>  ~/arrow/cpp/release $ cmake -GNinja -DCMAKE_BUILD_TYPE=Release 
> -DARROW_COMPUTE=ON -DARROW_BUILD_TESTS=ON 
> -DCMAKE_INSTALL_PREFIX=$(pwd)/_install -DCMAKE_INSTALL_LIBDIR=lib 
> -DARROW_PYTHON=ON -DCMAKE_CXX_COMPILER=/usr/bin/clang++-9 
> -DCMAKE_C_COMPILER=/usr/bin/clang-9 ..
> ~/arrow/cpp/release $ ninja install
> {code}
> build arrow python
> {code:bash}
>  ~/arrow/python $ python --version
>  Python 3.6.9
> ~/arrow/python $ python setup.py build_ext --inplace
>  ..
>  [ 93%] Building CXX object CMakeFiles/_compute.dir/_compute.cpp.o [27/1691]
>  In file included from 
> /usr/include/x86_64-linux-gnu/bits/types/stack_t.h:23:0, 
>  from /usr/include/signal.h:303,
>  from 
> /home/cyb/archery/lib/python3.6/site-packages/numpy/core/include/numpy/npy_interrupt.h:84,
>  from 
> /home/cyb/archery/lib/python3.6/site-packages/numpy/core/include/numpy/arrayobject.h:5,
>  from 
> /home/cyb/arrow/cpp/release/_install/include/arrow/python/numpy_interop.h:41,
>  from /home/cyb/arrow/cpp/release/_install/include/arrow/python/helpers.h:27,
>  from /home/cyb/arrow/cpp/release/_install/include/arrow/python/api.h:24,
>  from /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:696:
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp: In function 
> ‘int __Pyx_modinit_type_init_code()’:
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26034:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__CastOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__CastOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__CastOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26066:150: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__FilterOptions’ is undefined 
> [-Winvalid-offsetof]
>  type_7pyarrow_8_compute__FilterOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__FilterOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26082:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__TakeOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__TakeOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__TakeOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26130:150: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__MinMaxOptions’ is undefined 
> [-Winvalid-offsetof]
>  type_7pyarrow_8_compute__MinMaxOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__MinMaxOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26146:148: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__CountOptions’ is undefined [-Winvalid-offsetof]
>  _type_7pyarrow_8_compute__CountOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__CountOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^ 
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26162:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__ModeOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__ModeOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__ModeOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26210:154: 

[jira] [Updated] (ARROW-11400) [Python] Pickled ParquetFileFragment has invalid partition_expresion with dictionary type in pyarrow 2.0

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11400:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [Python] Pickled ParquetFileFragment has invalid partition_expresion with 
> dictionary type in pyarrow 2.0
> 
>
> Key: ARROW-11400
> URL: https://issues.apache.org/jira/browse/ARROW-11400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset, pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> From https://github.com/dask/dask/pull/7066#issuecomment-767156623
> Simplified reproducer:
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({'part': ['A', 'B']*5, 'col': range(10)})
> pq.write_to_dataset(table, "test_partitioned_parquet", 
> partition_cols=["part"])
> # with partitioning_kwargs = {} there is no error
> partitioning_kwargs = {"max_partition_dictionary_size": -1}
> dataset = ds.dataset(
> "test_partitioned_parquet/", format="parquet", 
> partitioning=ds.HivePartitioning.discover( **partitioning_kwargs)
> )
> frag = list(dataset.get_fragments())[0]
> {code}
> Querying this fragment works fine, but after serialization/deserialization 
> with pickle, it gives errors (and with the original data example I actually 
> got a segfault as well):
> {code}
> In [16]: import pickle
> In [17]: frag2 = pickle.loads(pickle.dumps(frag))
> In [19]: frag2.partition_expression
> ...
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16: 
> invalid continuation byte
> In [20]: frag2.to_table(schema=schema, columns=columns)
> Out[20]: 
> pyarrow.Table
> col: int64
> part: dictionary
> In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas()
> ...
> ~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.table_to_blocks()
> ArrowException: Unknown error: Wrapping ɻ� failed
> {code}
> It seems the issue was specifically with a partition expression with 
> dictionary type. 
> Also when using an integer columns as the partition column, you get wrong 
> values (but silently in this case):
> {code:python}
> In [42]: frag.partition_expression
> Out[42]: 
>1,
>   2
> ][0]:dictionary)>
> In [43]: frag2.partition_expression
> Out[43]: 
>170145232,
>   32754
> ][0]:dictionary)>
> {code}
> Now, it seems this is fixed in master. But since I don't remember it was 
> fixed intentionally ([~bkietz]?), it would be good to add some tests for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11268) [Rust][DataFusion] Support specifying repartitions in MemTable

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11268:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [Rust][DataFusion] Support specifying repartitions in MemTable 
> ---
>
> Key: ARROW-11268
> URL: https://issues.apache.org/jira/browse/ARROW-11268
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Daniël Heres
>Assignee: Daniël Heres
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11277) [C++] Fix compilation error in dataset expressions on macOS 10.11

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11277:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++] Fix compilation error in dataset expressions on macOS 10.11
> -
>
> Key: ARROW-11277
> URL: https://issues.apache.org/jira/browse/ARROW-11277
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> See https://github.com/autobrew/homebrew-core/pull/61#issuecomment-761605455
> R binary packages for macOS are built with an old SDK, so this is needed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11305) [Rust]: parquet-rowcount binary tries to open itself as a parquet file

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11305:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [Rust]: parquet-rowcount binary tries to open itself as a parquet file
> --
>
> Key: ARROW-11305
> URL: https://issues.apache.org/jira/browse/ARROW-11305
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Introduced accidentally during clippy warning cleanups in 
> https://github.com/apache/arrow/pull/8687/files#diff-f3f978052bd519af87898fa196715ddb445c327045c09ed07be600ca4e1703b6R60



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11329) [Rust] Do not rebuild the library on every change

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11329:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [Rust] Do not rebuild the library on every change
> -
>
> Key: ARROW-11329
> URL: https://issues.apache.org/jira/browse/ARROW-11329
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust
>Reporter: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11108) [Rust] Improve performance of MutableBuffer

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11108:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [Rust] Improve performance of MutableBuffer
> ---
>
> Key: ARROW-11108
> URL: https://issues.apache.org/jira/browse/ARROW-11108
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11216) [Rust] Improve documentation for StringDictionaryBuilder

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11216:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [Rust] Improve documentation for StringDictionaryBuilder
> 
>
> Key: ARROW-11216
> URL: https://issues.apache.org/jira/browse/ARROW-11216
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> I find myself trying to remember the exact incantation to create a 
> `StringDictionaryBuilder` so it should be a doc example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9128) [C++] Implement string space trimming kernels: trim, ltrim, and rtrim

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9128:
---
Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++] Implement string space trimming kernels: trim, ltrim, and rtrim
> -
>
> Key: ARROW-9128
> URL: https://issues.apache.org/jira/browse/ARROW-9128
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Maarten Breddels
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10489) [C++] Unable to configure or make with intel compiler

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-10489:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++] Unable to configure or make with intel compiler
> -
>
> Key: ARROW-10489
> URL: https://issues.apache.org/jira/browse/ARROW-10489
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 2.0.0
> Environment: SUSE Linux Enterprise Server 12 SP3 with Intel compiler 
> stack.
>Reporter: Jensen Richardson
>Assignee: Jensen Richardson
>Priority: Major
>  Labels: build, newbie, pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> I am attempting to compile the Arrow C++ libraries for use in an HPC 
> environment, and as such I need to use the Intel compilers to be compatible 
> with all of my other software packages. However, when I try to compile 
> (having set CC=icc, CXX=icpc, and CFORT=ifort), cmake throws the following 
> error:
> {code:java}
> CMake Error at cmake_modules/SetupCxxFlags.cmake:269 (message):   
>Unknown compiler: 18.0.2.20180210 18.0.2.20180210 
> Call Stack (most recent call first):   
>CMakeLists.txt:437 (include)
> {code}
> The interesting thing to me is that it thinks that 18.0.2.20180210 is the 
> name of the compiler, when earlier it output:
>  
> {code:java}
> -- Building using CMake version: 3.16.1 
> -- The C compiler identification is Intel 18.0.2.20180210
> -- The CXX compiler identification is Intel 18.0.2.20180210
> {code}
>  
> So I don't know why it's taking the 18.0.2.20180210 portion, instead of the 
> intel portion. Either way, it leaves me unable to build the libraries.
> I can provide the whole cmake log/error file if necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7633) [C++][CI] Create fuzz targets for tensors and sparse tensors

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-7633:
---
Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++][CI] Create fuzz targets for tensors and sparse tensors
> 
>
> Key: ARROW-7633
> URL: https://issues.apache.org/jira/browse/ARROW-7633
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Kenta Murata
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> These use separate API calls disjoint from RecordBatchFileReader and 
> RecordBatchStreamReader, so probably more natural to expose as separate fuzz 
> targets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11309) [Release][C#] Use .NET 3.1 for verification

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11309:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [Release][C#] Use .NET 3.1 for verification
> ---
>
> Key: ARROW-11309
> URL: https://issues.apache.org/jira/browse/ARROW-11309
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11303) [Release][C++] Enable mimalloc in the windows verification script

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11303:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [Release][C++] Enable mimalloc in the windows verification script
> -
>
> Key: ARROW-11303
> URL: https://issues.apache.org/jira/browse/ARROW-11303
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12442) [CI] Set job timeouts on Github Actions

2021-04-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324544#comment-17324544
 ] 

Antoine Pitrou commented on ARROW-12442:


cc [~kou], [~kszucs]

> [CI] Set job timeouts on Github Actions
> ---
>
> Key: ARROW-12442
> URL: https://issues.apache.org/jira/browse/ARROW-12442
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Critical
>
> The default timeout for a single job in Github Actions is 6 hours:
> [https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes]
> While our jobs normally do not exceed 1 hour of runtime (and most of them are 
> far quicker), sometimes some network issues may lead a job to take up the 
> full 6 hours before timing out. Not only does this contribute to our own 
> build queue growing unnecessarily, but it also impedes other Apache projects, 
> since the number of jobs which can be run in parallel is capped at the 
> organization level.
> We should therefore configure job timeouts which reflect our expectation of 
> the overall runtime for each job. 1 hour should be a safe value for most of 
> them, and would already dramatically reduce the impact of network issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12442) [CI] Set job timeouts on Github Actions

2021-04-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324543#comment-17324543
 ] 

Antoine Pitrou commented on ARROW-12442:


cc [~elek] for information.

> [CI] Set job timeouts on Github Actions
> ---
>
> Key: ARROW-12442
> URL: https://issues.apache.org/jira/browse/ARROW-12442
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Critical
>
> The default timeout for a single job in Github Actions is 6 hours:
> [https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes]
> While our jobs normally do not exceed 1 hour of runtime (and most of them are 
> far quicker), sometimes some network issues may lead a job to take up the 
> full 6 hours before timing out. Not only does this contribute to our own 
> build queue growing unnecessarily, but it also impedes other Apache projects, 
> since the number of jobs which can be run in parallel is capped at the 
> organization level.
> We should therefore configure job timeouts which reflect our expectation of 
> the overall runtime for each job. 1 hour should be a safe value for most of 
> them, and would already dramatically reduce the impact of network issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12442) [CI] Set job timeouts on Github Actions

2021-04-18 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-12442:
--

 Summary: [CI] Set job timeouts on Github Actions
 Key: ARROW-12442
 URL: https://issues.apache.org/jira/browse/ARROW-12442
 Project: Apache Arrow
  Issue Type: Task
  Components: Continuous Integration
Reporter: Antoine Pitrou


The default timeout for a single job in Github Actions is 6 hours:

[https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes]

While our jobs normally do not exceed 1 hour of runtime (and most of them are 
far quicker), sometimes some network issues may lead a job to take up the full 
6 hours before timing out. Not only does this contribute to our own build queue 
growing unnecessarily, but it also impedes other Apache projects, since the 
number of jobs which can be run in parallel is capped at the organization level.

We should therefore configure job timeouts which reflect our expectation of the 
overall runtime for each job. 1 hour should be a safe value for most of them, 
and would already dramatically reduce the impact of network issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12441) [Rust][DataFusion] Support cartesian join

2021-04-18 Thread Jira
Daniël Heres created ARROW-12441:


 Summary: [Rust][DataFusion] Support cartesian join
 Key: ARROW-12441
 URL: https://issues.apache.org/jira/browse/ARROW-12441
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12440) [Release] Various packaging, release script and release verification script fixes

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-12440:

Component/s: Packaging

> [Release] Various packaging, release script and release verification script 
> fixes
> -
>
> Key: ARROW-12440
> URL: https://issues.apache.org/jira/browse/ARROW-12440
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools, Packaging
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 4.0.0
>
>
> Fixes for issues surfaced during the preparation of 4.0.0-RC0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12440) [Release] Various packaging, release script and release verification script fixes

2021-04-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12440:
---
Labels: pull-request-available  (was: )

> [Release] Various packaging, release script and release verification script 
> fixes
> -
>
> Key: ARROW-12440
> URL: https://issues.apache.org/jira/browse/ARROW-12440
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools, Packaging
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Fixes for issues surfaced during the preparation of 4.0.0-RC0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12440) [Release] Various packaging, release script and release verification script fixes

2021-04-18 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-12440:
---

 Summary: [Release] Various packaging, release script and release 
verification script fixes
 Key: ARROW-12440
 URL: https://issues.apache.org/jira/browse/ARROW-12440
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Krisztian Szucs
 Fix For: 4.0.0


Fixes for issues surfaced during the preparation of 4.0.0-RC0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12432) [Rust] [DataFusion] Add metrics for SortExec

2021-04-18 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-12432:
--

Assignee: Andy Grove

> [Rust] [DataFusion] Add metrics for SortExec
> 
>
> Key: ARROW-12432
> URL: https://issues.apache.org/jira/browse/ARROW-12432
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Add metrics for SortExec



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12432) [Rust] [DataFusion] Add metrics for SortExec

2021-04-18 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-12432.

Resolution: Fixed

Issue resolved by pull request 10078
[https://github.com/apache/arrow/pull/10078]

> [Rust] [DataFusion] Add metrics for SortExec
> 
>
> Key: ARROW-12432
> URL: https://issues.apache.org/jira/browse/ARROW-12432
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Add metrics for SortExec



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12436) [Rust][Ballista] Add watch capabilities to config backend trait

2021-04-18 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-12436.

Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10085
[https://github.com/apache/arrow/pull/10085]

> [Rust][Ballista] Add watch capabilities to config backend trait
> ---
>
> Key: ARROW-12436
> URL: https://issues.apache.org/jira/browse/ARROW-12436
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust - Ballista
>Reporter: Ximo Guanter
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [arrow/lib.rs at 66aa3e7c365a8d4c4eca6e23668f2988e714b493 · apache/arrow 
> (github.com)|https://github.com/apache/arrow/blob/66aa3e7c365a8d4c4eca6e23668f2988e714b493/rust/ballista/rust/scheduler/src/lib.rs#L183]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (ARROW-11780) [C++][Python] StructArray.from_arrays() crashes Python interpreter

2021-04-18 Thread Prakhar Pandey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakhar Pandey updated ARROW-11780:
---
Comment: was deleted

(was: Like Alessandro mentions, call to _pyarrow_unwrap_array_ returns a null 
pointer because passed array is ChunkedArray. However I see 
_pyarrow_unwrap_chunked_array_ method present in the same file, maybe that 
should have been called?

I would be happy to investigate more and raise a PR.

 

 )

> [C++][Python] StructArray.from_arrays() crashes Python interpreter
> --
>
> Key: ARROW-11780
> URL: https://issues.apache.org/jira/browse/ARROW-11780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: ARF
>Assignee: Weston Pace
>Priority: Major
>
> {{StructArray.from_arrays()}} crashes the Python interpreter without error 
> message:
> {code:none}
> (test_pyarrow) Z:\test_pyarrow>python
> Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: 
> Anaconda, Inc. on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa
> >>>
> >>> table = pa.Table.from_pydict({
> ... 'foo': pa.array([1, 2, 3]),
> ... 'bar': pa.array([4, 5, 6])
> ... })
> >>>
> >>> pa.StructArray.from_arrays([table['foo'], table['bar']], ['foo', 'bar'])
> (test_pyarrow) Z:\test_pyarrow>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12438) [Rust] [DataFusion] Add support for partition pruning

2021-04-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniël Heres updated ARROW-12438:
-
Description: 
Once we implement

https://issues.apache.org/jira/browse/ARROW-11019

would be good to add support for partition pruning optimization based on 
filters / `WHERE` clause.

  was:
Once we implement

https://issues.apache.org/jira/browse/ARROW-11019

 

would be good to add support for partition pruning optimization.


> [Rust] [DataFusion] Add support for partition pruning
> -
>
> Key: ARROW-12438
> URL: https://issues.apache.org/jira/browse/ARROW-12438
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Daniël Heres
>Assignee: Daniël Heres
>Priority: Major
>
> Once we implement
> https://issues.apache.org/jira/browse/ARROW-11019
> would be good to add support for partition pruning optimization based on 
> filters / `WHERE` clause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12439) [Rust] [DataFusion] Add support for eliminating hash repartition

2021-04-18 Thread Jira
Daniël Heres created ARROW-12439:


 Summary: [Rust] [DataFusion] Add support for eliminating hash 
repartition
 Key: ARROW-12439
 URL: https://issues.apache.org/jira/browse/ARROW-12439
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Daniël Heres


If the intermediate data is already partitioned on a certain expression (key), 
the repartition doesn't have to be added (in a join), or should be removed in 
an optimization rule. This will avoid having to repartition (and maybe shuffle 
in Ballista).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12438) [Rust] [DataFusion] Add support for partition pruning

2021-04-18 Thread Jira
Daniël Heres created ARROW-12438:


 Summary: [Rust] [DataFusion] Add support for partition pruning
 Key: ARROW-12438
 URL: https://issues.apache.org/jira/browse/ARROW-12438
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Daniël Heres
Assignee: Daniël Heres


Once we implement

https://issues.apache.org/jira/browse/ARROW-11019

 

would be good to add support for partition pruning optimization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11780) [C++][Python] StructArray.from_arrays() crashes Python interpreter

2021-04-18 Thread Prakhar Pandey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324480#comment-17324480
 ] 

Prakhar Pandey commented on ARROW-11780:


Like Alessandro mentions, call to _pyarrow_unwrap_array_ returns a null pointer 
because passed array is ChunkedArray. However I see 
_pyarrow_unwrap_chunked_array_ method present in the same file, maybe that 
should have been called?

I would be happy to investigate more and raise a PR.

 

 

> [C++][Python] StructArray.from_arrays() crashes Python interpreter
> --
>
> Key: ARROW-11780
> URL: https://issues.apache.org/jira/browse/ARROW-11780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: ARF
>Assignee: Weston Pace
>Priority: Major
>
> {{StructArray.from_arrays()}} crashes the Python interpreter without error 
> message:
> {code:none}
> (test_pyarrow) Z:\test_pyarrow>python
> Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: 
> Anaconda, Inc. on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa
> >>>
> >>> table = pa.Table.from_pydict({
> ... 'foo': pa.array([1, 2, 3]),
> ... 'bar': pa.array([4, 5, 6])
> ... })
> >>>
> >>> pa.StructArray.from_arrays([table['foo'], table['bar']], ['foo', 'bar'])
> (test_pyarrow) Z:\test_pyarrow>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12425) [Rust] new_null_array doesn't allocate keys buffer for dictionary arrays

2021-04-18 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-12425.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 10072
[https://github.com/apache/arrow/pull/10072]

> [Rust] new_null_array doesn't allocate keys buffer for dictionary arrays
> 
>
> Key: ARROW-12425
> URL: https://issues.apache.org/jira/browse/ARROW-12425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Raphael Taylor-Davies
>Assignee: Raphael Taylor-Davies
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12425) [Rust] new_null_array doesn't allocate keys buffer for dictionary arrays

2021-04-18 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-12425:

Component/s: Rust

> [Rust] new_null_array doesn't allocate keys buffer for dictionary arrays
> 
>
> Key: ARROW-12425
> URL: https://issues.apache.org/jira/browse/ARROW-12425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Raphael Taylor-Davies
>Assignee: Raphael Taylor-Davies
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12398) [Rust] Remove double bound checks in iterators

2021-04-18 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-12398.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 10046
[https://github.com/apache/arrow/pull/10046]

> [Rust] Remove double bound checks in iterators
> --
>
> Key: ARROW-12398
> URL: https://issues.apache.org/jira/browse/ARROW-12398
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Ritchie
>Assignee: Ritchie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12398) [Rust] Remove double bound checks in iterators

2021-04-18 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-12398:

Summary: [Rust] Remove double bound checks in iterators  (was: Remove 
double bound checks in iterators)

> [Rust] Remove double bound checks in iterators
> --
>
> Key: ARROW-12398
> URL: https://issues.apache.org/jira/browse/ARROW-12398
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Ritchie
>Assignee: Ritchie
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics

2021-04-18 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb reassigned ARROW-12433:
---

Assignee: Andy Grove

> [Rust] Builds failing due to new flatbuffer release introducing const generics
> --
>
> Key: ARROW-12433
> URL: https://issues.apache.org/jira/browse/ARROW-12433
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 4.0.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I filed [https://github.com/google/flatbuffers/issues/6572] but for now we 
> should pin the dependency to 0.8.3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)