[jira] [Created] (ARROW-12356) [Website] Update install page instructions to point to artifactory
Neal Richardson created ARROW-12356: --- Summary: [Website] Update install page instructions to point to artifactory Key: ARROW-12356 URL: https://issues.apache.org/jira/browse/ARROW-12356 Project: Apache Arrow Issue Type: Sub-task Components: Website Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 4.0.0 Looks like packages for old versions have been moved over, even if we can't upload new ones yet. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12355) [C++] Implement efficient async CSV scanning
Weston Pace created ARROW-12355: --- Summary: [C++] Implement efficient async CSV scanning Key: ARROW-12355 URL: https://issues.apache.org/jira/browse/ARROW-12355 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace ARROW-12289 adds an inefficient but simple AsyncScanner implementation that does not rely on asynchronous readers. This task is to implement the asynchronous scan operation properly for CSV. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12354) [Packaging][RPM] Use apache.jfrog.io/artifactory/ instead of apache.bintray.com/
Kouhei Sutou created ARROW-12354: Summary: [Packaging][RPM] Use apache.jfrog.io/artifactory/ instead of apache.bintray.com/ Key: ARROW-12354 URL: https://issues.apache.org/jira/browse/ARROW-12354 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12353) [Packaging][deb] Rename -archive-keyring to -apt-source
Kouhei Sutou created ARROW-12353: Summary: [Packaging][deb] Rename -archive-keyring to -apt-source Key: ARROW-12353 URL: https://issues.apache.org/jira/browse/ARROW-12353 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou Because lintian recommends that a package that puts files to /etc/apt/sources.list.d/ uses -apt-source suffix. See also: https://lintian.debian.net/tags/package-installs-apt-sources This also changes repository URL to https://apache.jfrog.io/artifactory/ from https://apache.bintray.com/ . -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12352) [CI][R][Windows] Remove needless workaround for MSYS2
Kouhei Sutou created ARROW-12352: Summary: [CI][R][Windows] Remove needless workaround for MSYS2 Key: ARROW-12352 URL: https://issues.apache.org/jira/browse/ARROW-12352 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Kouhei Sutou Assignee: Kouhei Sutou repo.msys2.org is alive. sf.net is fragile than repo.msys2.org. See also ARROW-10202. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12351) [CI][Ruby] Use ruby/setup-ruby instead of actions/setup-ruby
Kouhei Sutou created ARROW-12351: Summary: [CI][Ruby] Use ruby/setup-ruby instead of actions/setup-ruby Key: ARROW-12351 URL: https://issues.apache.org/jira/browse/ARROW-12351 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Ruby Reporter: Kouhei Sutou Assignee: Kouhei Sutou Because actions/setup-ruby is deprecated: {quote} Please note: This action is deprecated and should no longer be used. The team at GitHub has ceased making and accepting code contributions or maintaining issues tracker. Please, migrate your workflows to the ruby/setup-ruby, which is being actively maintained by the official Ruby organization. {quote} https://github.com/actions/setup-ruby#setup-ruby -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12349) [MATLAB] add support for converting a MATLAB uint64 array to an arrow::NumericArrays arrow::NumericArray
Sarah Gilmore created ARROW-12349: - Summary: [MATLAB] add support for converting a MATLAB uint64 array to an arrow::NumericArrays arrow::NumericArray Key: ARROW-12349 URL: https://issues.apache.org/jira/browse/ARROW-12349 Project: Apache Arrow Issue Type: Task Components: MATLAB Reporter: Sarah Gilmore Create a C++ function that accepts a MALTAB uint64 array and converts it into a arrow::NumericArray. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12350) [MATLAB] Add examples/ directory to demonstrate workflows
Fiona La created ARROW-12350: Summary: [MATLAB] Add examples/ directory to demonstrate workflows Key: ARROW-12350 URL: https://issues.apache.org/jira/browse/ARROW-12350 Project: Apache Arrow Issue Type: Task Components: MATLAB Reporter: Fiona La Assignee: Fiona La Create an examples/ directory under matlab/ that contains MATLAB scripts to demonstrate workflows enabled by the MATLAB Interface for Apache Arrow. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12348) Add architecture doc illustrating design
Fiona La created ARROW-12348: Summary: Add architecture doc illustrating design Key: ARROW-12348 URL: https://issues.apache.org/jira/browse/ARROW-12348 Project: Apache Arrow Issue Type: Sub-task Reporter: Fiona La -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12347) Add subclass: arrow.UInt64Array m-code
Fiona La created ARROW-12347: Summary: Add subclass: arrow.UInt64Array m-code Key: ARROW-12347 URL: https://issues.apache.org/jira/browse/ARROW-12347 Project: Apache Arrow Issue Type: Sub-task Reporter: Fiona La -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12346) Add abstract function for null information
Fiona La created ARROW-12346: Summary: Add abstract function for null information Key: ARROW-12346 URL: https://issues.apache.org/jira/browse/ARROW-12346 Project: Apache Arrow Issue Type: Sub-task Reporter: Fiona La -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12345) Add abstract function for querying size and type
Fiona La created ARROW-12345: Summary: Add abstract function for querying size and type Key: ARROW-12345 URL: https://issues.apache.org/jira/browse/ARROW-12345 Project: Apache Arrow Issue Type: Sub-task Reporter: Fiona La -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12344) Add abstract function for display
Fiona La created ARROW-12344: Summary: Add abstract function for display Key: ARROW-12344 URL: https://issues.apache.org/jira/browse/ARROW-12344 Project: Apache Arrow Issue Type: Sub-task Reporter: Fiona La -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12343) [Rust] Support auto-vectorization for min/max
Daniël Heres created ARROW-12343: Summary: [Rust] Support auto-vectorization for min/max Key: ARROW-12343 URL: https://issues.apache.org/jira/browse/ARROW-12343 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Daniël Heres Assignee: Daniël Heres -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12342) [Packaging] Fix tabulation in crossbow templates for submitting nightly builds
Krisztian Szucs created ARROW-12342: --- Summary: [Packaging] Fix tabulation in crossbow templates for submitting nightly builds Key: ARROW-12342 URL: https://issues.apache.org/jira/browse/ARROW-12342 Project: Apache Arrow Issue Type: Bug Components: Packaging Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 4.0.0 We upload gemfury artifacts from the nightly builds only checking arrow's branch we submit the builds against. The jinja macro produced wrong yml configurations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12341) [C++] Get rid of Result>
Weston Pace created ARROW-12341: --- Summary: [C++] Get rid of Result> Key: ARROW-12341 URL: https://issues.apache.org/jira/browse/ARROW-12341 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace Assignee: Weston Pace Fix For: 5.0.0 Prefer MakeFailingGenerator. This should simplify calling code and keep things to a single failure path. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12340) [Java] Avro to Arrow converter doesn't appear to generate valid arrow data
Micah Kornfield created ARROW-12340: --- Summary: [Java] Avro to Arrow converter doesn't appear to generate valid arrow data Key: ARROW-12340 URL: https://issues.apache.org/jira/browse/ARROW-12340 Project: Apache Arrow Issue Type: Bug Reporter: Micah Kornfield I think this is related to how Unions are handled (I had thought unions of with a null and one other type would get created to the nullable type, but that is a separate issue). I haven't had time to fully diagnose, but remnants of the code I tried to use are at [https://gist.github.com/emkornfield/efd3a4c3c1012dc19cf9769198e3bffe] And the CSV file from https://issues.apache.org/jira/browse/ARROW-11629?jql=text%20~%20%22arrow%20drill%20parquet%20dictionary%22 produce data that isn't readable by the C++ implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12339) [Rust][DataFusion] COUNT DISTINCT does not support for `Boolean`
Andrew Lamb created ARROW-12339: --- Summary: [Rust][DataFusion] COUNT DISTINCT does not support for `Boolean` Key: ARROW-12339 URL: https://issues.apache.org/jira/browse/ARROW-12339 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andrew Lamb If you try to run a `COUNT (DISTINCT ..)` query on a float column you get the following error: thread 'tokio-runtime-worker' panicked at 'Unexpected DataType for list', datafusion/src/scalar.rs:342:22 Reproducer: {code} echo "foo,1.23" > /tmp/foo.csv ./target/debug/datafusion-cli > CREATE EXTERNAL TABLE t (a varchar, b float) STORED AS CSV LOCATION > '/tmp/foo.csv'; 0 rows in set. Query took 0 seconds. > select count(distinct a) from t; +---+ | COUNT(DISTINCT a) | +---+ | 1 | +---+ 1 rows in set. Query took 0 seconds. > select count(distinct b) from t; thread 'tokio-runtime-worker' panicked at 'Unexpected DataType for list', datafusion/src/scalar.rs:342:22 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ArrowError(ExternalError(Canceled)) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12338) [Python] Permission denied while accessing HDFS data
Suhas N M created ARROW-12338: - Summary: [Python] Permission denied while accessing HDFS data Key: ARROW-12338 URL: https://issues.apache.org/jira/browse/ARROW-12338 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 3.0.0 Reporter: Suhas N M Hi, I have been trying to connect to HDFS cluster using pyarrow version 3.0.0, connection goes through, but I am unable to perform any operation involving HDFS cluster. Here is the error thrown: Traceback (most recent call last): File "pyarrow_test.py", line 8, in hdfs.create_dir('test3') File "pyarrow/_fs.pyx", line 450, in pyarrow._fs.FileSystem.create_dir File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: HDFS create directory failed, errno: 13 (Permission denied) PS: I have checked access permissions and they are correct. I am able to access the files and create directories with the 'hdfs' command. Hadoop cluster is Kerberos enabled, I have used the following line to create connection: hdfs = fs.HadoopFileSystem('', 8020, user='', kerb_ticket='/tmp/krb5cc_500') -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12337) add DoubleEndedIterator and ExactSizeIterator traits
Ritchie created ARROW-12337: --- Summary: add DoubleEndedIterator and ExactSizeIterator traits Key: ARROW-12337 URL: https://issues.apache.org/jira/browse/ARROW-12337 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Ritchie Assignee: Ritchie Make arrow array iterators implement DoubleEndedIterator and ExactSizeIterator traits -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12336) [C++][Python] Empty Int64 array is of wrong size
Thomas Blauth created ARROW-12336: - Summary: [C++][Python] Empty Int64 array is of wrong size Key: ARROW-12336 URL: https://issues.apache.org/jira/browse/ARROW-12336 Project: Apache Arrow Issue Type: Bug Components: C++, Python Environment: macOS 10.15.7 Arrow version: 3.1.0.dev578 Reporter: Thomas Blauth Setup: Table with Int64 and str columns; generated using the dataset api; filtered on str column. Bug Description: Calling {{table.to_pandas()}} fails due to an empty array of the ChunkedArray of the Int64 column. This empty array has a size of 4 Byte when using the arrow nightly builds and 0 Byte when using arrow 3.0.0. Note: The bug does not occur when the table only contains an Int64 column. Minimal example: {code:python} import pandas as pd import pyarrow as pa import pyarrow.parquet import pyarrow.dataset print("Arrow version: " + str(pa.__version__)) print("---") # Only Int64 works fine df = pd.DataFrame({"Int_col": [1, 2, 10]}, dtype="Int64") table = pa.table(df) path_0 = "./test_0.parquet" pa.parquet.write_table(table, path_0) schema = pa.parquet.read_schema(path_0) ds = pa.dataset.FileSystemDataset.from_paths( paths=[path_0], filesystem=pa.fs.LocalFileSystem(), schema=schema, format=pa.dataset.ParquetFileFormat(), ) table = ds.to_table(filter=(pa.dataset.field("Int_col") == 3)) print("Size of array: " + str(table.column(0).nbytes)) df = table.to_pandas() print("---") # Int64 and str crashes df = pd.DataFrame({"Int_col": [1, 2, 10], "str_col": ["A", "B", "Z"]}) df = df.astype({"Int_col": "Int64"}) table = pa.table(df) path_1 = "./test_1.parquet" pa.parquet.write_table(table, path_1) schema = pa.parquet.read_schema(path_1) ds = pa.dataset.FileSystemDataset.from_paths( paths=[path_1], filesystem=pa.fs.LocalFileSystem(), schema=schema, format=pa.dataset.ParquetFileFormat(), ) table = ds.to_table(filter=(pa.dataset.field("str_col") == "C")) print("Size of array: " + str(table.column(0).nbytes)) df = table.to_pandas() {code} Output : {code:bash} Arrow version: 3.1.0.dev578 --- Size of array: 0 --- Size of array: 4 Traceback (most recent call last): File "/Users/xxx/empty_array_buffer_size.py", line 47, in df = table.to_pandas() File "pyarrow/array.pxi", line 756, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 1740, in pyarrow.lib.Table._to_pandas File "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 794, in table_to_blockmanager blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) File "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1135, in _table_to_blocks return [_reconstruct_block(item, columns, extension_columns) File "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1135, in return [_reconstruct_block(item, columns, extension_columns) File "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 753, in _reconstruct_block pd_ext_arr = pandas_dtype.__from_arrow__(arr) File "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/integer.py", line 117, in __from_arrow__ data, mask = pyarrow_array_to_numpy_and_mask(arr, dtype=self.type) File "/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/_arrow_utils.py", line 32, in pyarrow_array_to_numpy_and_mask data = np.frombuffer(buflist[1], dtype=dtype)[arr.offset : arr.offset + len(arr)] ValueError: buffer size must be a multiple of element size {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)