[jira] [Created] (ARROW-12356) [Website] Update install page instructions to point to artifactory

2021-04-12 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-12356:
---

 Summary: [Website] Update install page instructions to point to 
artifactory
 Key: ARROW-12356
 URL: https://issues.apache.org/jira/browse/ARROW-12356
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Website
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 4.0.0


Looks like packages for old versions have been moved over, even if we can't 
upload new ones yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12355) [C++] Implement efficient async CSV scanning

2021-04-12 Thread Weston Pace (Jira)
Weston Pace created ARROW-12355:
---

 Summary: [C++] Implement efficient async CSV scanning
 Key: ARROW-12355
 URL: https://issues.apache.org/jira/browse/ARROW-12355
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


ARROW-12289 adds an inefficient but simple AsyncScanner implementation that 
does not rely on asynchronous readers.  This task is to implement the 
asynchronous scan operation properly for CSV.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12354) [Packaging][RPM] Use apache.jfrog.io/artifactory/ instead of apache.bintray.com/

2021-04-12 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-12354:


 Summary: [Packaging][RPM] Use apache.jfrog.io/artifactory/ instead 
of apache.bintray.com/
 Key: ARROW-12354
 URL: https://issues.apache.org/jira/browse/ARROW-12354
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12353) [Packaging][deb] Rename -archive-keyring to -apt-source

2021-04-12 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-12353:


 Summary: [Packaging][deb] Rename -archive-keyring to -apt-source
 Key: ARROW-12353
 URL: https://issues.apache.org/jira/browse/ARROW-12353
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


Because lintian recommends that a package that puts files to
/etc/apt/sources.list.d/ uses -apt-source suffix.

See also: https://lintian.debian.net/tags/package-installs-apt-sources

This also changes repository URL to
https://apache.jfrog.io/artifactory/ from https://apache.bintray.com/ .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12352) [CI][R][Windows] Remove needless workaround for MSYS2

2021-04-12 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-12352:


 Summary: [CI][R][Windows] Remove needless workaround for MSYS2
 Key: ARROW-12352
 URL: https://issues.apache.org/jira/browse/ARROW-12352
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


repo.msys2.org is alive. sf.net is fragile than repo.msys2.org.

See also ARROW-10202.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12351) [CI][Ruby] Use ruby/setup-ruby instead of actions/setup-ruby

2021-04-12 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-12351:


 Summary: [CI][Ruby] Use ruby/setup-ruby instead of 
actions/setup-ruby
 Key: ARROW-12351
 URL: https://issues.apache.org/jira/browse/ARROW-12351
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Ruby
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


Because actions/setup-ruby is deprecated:

{quote}
Please note: This action is deprecated and should no longer be used. The team 
at GitHub has ceased making and accepting code contributions or maintaining 
issues tracker. Please, migrate your workflows to the ruby/setup-ruby, which is 
being actively maintained by the official Ruby organization.
{quote}

https://github.com/actions/setup-ruby#setup-ruby



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12349) [MATLAB] add support for converting a MATLAB uint64 array to an arrow::NumericArrays arrow::NumericArray

2021-04-12 Thread Sarah Gilmore (Jira)
Sarah Gilmore created ARROW-12349:
-

 Summary: [MATLAB] add support for converting a MATLAB uint64 array 
to an arrow::NumericArrays arrow::NumericArray
 Key: ARROW-12349
 URL: https://issues.apache.org/jira/browse/ARROW-12349
 Project: Apache Arrow
  Issue Type: Task
  Components: MATLAB
Reporter: Sarah Gilmore


Create a C++ function that accepts a MALTAB uint64 array and converts it into a 
arrow::NumericArray.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12350) [MATLAB] Add examples/ directory to demonstrate workflows

2021-04-12 Thread Fiona La (Jira)
Fiona La created ARROW-12350:


 Summary: [MATLAB] Add examples/ directory to demonstrate workflows
 Key: ARROW-12350
 URL: https://issues.apache.org/jira/browse/ARROW-12350
 Project: Apache Arrow
  Issue Type: Task
  Components: MATLAB
Reporter: Fiona La
Assignee: Fiona La


Create an examples/ directory under matlab/ that contains MATLAB scripts to 
demonstrate workflows enabled by the MATLAB Interface for Apache Arrow. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12348) Add architecture doc illustrating design

2021-04-12 Thread Fiona La (Jira)
Fiona La created ARROW-12348:


 Summary: Add architecture doc illustrating design
 Key: ARROW-12348
 URL: https://issues.apache.org/jira/browse/ARROW-12348
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Fiona La






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12347) Add subclass: arrow.UInt64Array m-code

2021-04-12 Thread Fiona La (Jira)
Fiona La created ARROW-12347:


 Summary: Add subclass: arrow.UInt64Array m-code
 Key: ARROW-12347
 URL: https://issues.apache.org/jira/browse/ARROW-12347
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Fiona La






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12346) Add abstract function for null information

2021-04-12 Thread Fiona La (Jira)
Fiona La created ARROW-12346:


 Summary: Add abstract function for null information
 Key: ARROW-12346
 URL: https://issues.apache.org/jira/browse/ARROW-12346
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Fiona La






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12345) Add abstract function for querying size and type

2021-04-12 Thread Fiona La (Jira)
Fiona La created ARROW-12345:


 Summary: Add abstract function for querying size and type
 Key: ARROW-12345
 URL: https://issues.apache.org/jira/browse/ARROW-12345
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Fiona La






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12344) Add abstract function for display

2021-04-12 Thread Fiona La (Jira)
Fiona La created ARROW-12344:


 Summary: Add abstract function for display
 Key: ARROW-12344
 URL: https://issues.apache.org/jira/browse/ARROW-12344
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Fiona La






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12343) [Rust] Support auto-vectorization for min/max

2021-04-12 Thread Jira
Daniël Heres created ARROW-12343:


 Summary: [Rust] Support auto-vectorization for min/max
 Key: ARROW-12343
 URL: https://issues.apache.org/jira/browse/ARROW-12343
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12342) [Packaging] Fix tabulation in crossbow templates for submitting nightly builds

2021-04-12 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-12342:
---

 Summary: [Packaging] Fix tabulation in crossbow templates for 
submitting nightly builds
 Key: ARROW-12342
 URL: https://issues.apache.org/jira/browse/ARROW-12342
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 4.0.0


We upload gemfury artifacts from the nightly builds only checking arrow's 
branch we submit the builds against. The jinja macro produced wrong yml 
configurations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12341) [C++] Get rid of Result>

2021-04-12 Thread Weston Pace (Jira)
Weston Pace created ARROW-12341:
---

 Summary: [C++] Get rid of Result>
 Key: ARROW-12341
 URL: https://issues.apache.org/jira/browse/ARROW-12341
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace
Assignee: Weston Pace
 Fix For: 5.0.0


Prefer MakeFailingGenerator.  This should simplify calling code and keep things 
to a single failure path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12340) [Java] Avro to Arrow converter doesn't appear to generate valid arrow data

2021-04-12 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-12340:
---

 Summary: [Java] Avro to Arrow converter doesn't appear to generate 
valid arrow data
 Key: ARROW-12340
 URL: https://issues.apache.org/jira/browse/ARROW-12340
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Micah Kornfield


I think this is related to how Unions are handled (I had thought unions of with 
a null and one other type would get created to the nullable type, but that is a 
separate issue).

 

I haven't had time to fully diagnose, but remnants of the code I tried to use 
are at [https://gist.github.com/emkornfield/efd3a4c3c1012dc19cf9769198e3bffe]

 

And the CSV file from 
https://issues.apache.org/jira/browse/ARROW-11629?jql=text%20~%20%22arrow%20drill%20parquet%20dictionary%22

 

produce data that isn't readable by the C++ implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12339) [Rust][DataFusion] COUNT DISTINCT does not support for `Boolean`

2021-04-12 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-12339:
---

 Summary: [Rust][DataFusion] COUNT DISTINCT does not support for 
`Boolean`
 Key: ARROW-12339
 URL: https://issues.apache.org/jira/browse/ARROW-12339
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andrew Lamb


If you try to run a `COUNT (DISTINCT ..)` query on a float column you get the 
following error:

thread 'tokio-runtime-worker' panicked at 'Unexpected DataType for list', 
datafusion/src/scalar.rs:342:22

Reproducer:
{code}
 echo "foo,1.23" > /tmp/foo.csv
 ./target/debug/datafusion-cli

> CREATE EXTERNAL TABLE t (a varchar, b float) STORED AS CSV LOCATION 
> '/tmp/foo.csv';
0 rows in set. Query took 0 seconds.
> select count(distinct a) from t;
+---+
| COUNT(DISTINCT a) |
+---+
| 1 |
+---+
1 rows in set. Query took 0 seconds.
> select count(distinct b) from t;
thread 'tokio-runtime-worker' panicked at 'Unexpected DataType for list', 
datafusion/src/scalar.rs:342:22
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
ArrowError(ExternalError(Canceled))
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12338) [Python] Permission denied while accessing HDFS data

2021-04-12 Thread Suhas N M (Jira)
Suhas N M created ARROW-12338:
-

 Summary: [Python] Permission denied while accessing HDFS data
 Key: ARROW-12338
 URL: https://issues.apache.org/jira/browse/ARROW-12338
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 3.0.0
Reporter: Suhas N M


Hi, I have been trying to connect to HDFS cluster using pyarrow version 3.0.0, 
connection goes through, but I am unable to perform any operation involving 
HDFS cluster. Here is the error thrown:

Traceback (most recent call last):
  File "pyarrow_test.py", line 8, in 
hdfs.create_dir('test3')
  File "pyarrow/_fs.pyx", line 450, in pyarrow._fs.FileSystem.create_dir
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: HDFS create directory failed, errno: 13 (Permission denied)

PS: I have checked access permissions and they are correct. I am able to access 
the files and create directories with the 'hdfs' command. 
Hadoop cluster is Kerberos enabled, I have used the following line to create 
connection:
hdfs = fs.HadoopFileSystem('', 8020, user='', 
kerb_ticket='/tmp/krb5cc_500')







--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12337) add DoubleEndedIterator and ExactSizeIterator traits

2021-04-12 Thread Ritchie (Jira)
Ritchie created ARROW-12337:
---

 Summary: add DoubleEndedIterator and ExactSizeIterator traits
 Key: ARROW-12337
 URL: https://issues.apache.org/jira/browse/ARROW-12337
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Ritchie
Assignee: Ritchie


Make arrow array iterators implement DoubleEndedIterator and ExactSizeIterator 
traits



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12336) [C++][Python] Empty Int64 array is of wrong size

2021-04-12 Thread Thomas Blauth (Jira)
Thomas Blauth created ARROW-12336:
-

 Summary: [C++][Python] Empty Int64 array is of wrong size
 Key: ARROW-12336
 URL: https://issues.apache.org/jira/browse/ARROW-12336
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
 Environment: macOS 10.15.7
Arrow version: 3.1.0.dev578
Reporter: Thomas Blauth


Setup:

Table with Int64 and str columns; generated using the dataset api; filtered on 
str column.

 

Bug Description:

Calling {{table.to_pandas()}} fails due to an empty array of the ChunkedArray 
of the Int64 column. This empty array has a size of 4 Byte when using the arrow 
nightly builds and 0 Byte when using arrow 3.0.0.

Note: The bug does not occur when the table only contains an Int64 column.

 

Minimal example:
{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet
import pyarrow.dataset

print("Arrow version: " + str(pa.__version__))
print("---")

# Only Int64 works fine
df = pd.DataFrame({"Int_col": [1, 2, 10]}, dtype="Int64")
table = pa.table(df)
path_0 = "./test_0.parquet"
pa.parquet.write_table(table, path_0)

schema = pa.parquet.read_schema(path_0)
ds = pa.dataset.FileSystemDataset.from_paths(
paths=[path_0],
filesystem=pa.fs.LocalFileSystem(),
schema=schema, 
format=pa.dataset.ParquetFileFormat(),
)
table = ds.to_table(filter=(pa.dataset.field("Int_col") == 3))

print("Size of array: " + str(table.column(0).nbytes))
df = table.to_pandas()
print("---")


# Int64 and str crashes
df = pd.DataFrame({"Int_col": [1, 2, 10], "str_col": ["A", "B", "Z"]})
df = df.astype({"Int_col": "Int64"})
table = pa.table(df)
path_1 = "./test_1.parquet"
pa.parquet.write_table(table, path_1)

schema = pa.parquet.read_schema(path_1)
ds = pa.dataset.FileSystemDataset.from_paths(
paths=[path_1],
filesystem=pa.fs.LocalFileSystem(),
schema=schema, 
format=pa.dataset.ParquetFileFormat(),
)
table = ds.to_table(filter=(pa.dataset.field("str_col") == "C"))

print("Size of array: " + str(table.column(0).nbytes))
df = table.to_pandas()
{code}
 

Output :
{code:bash}
Arrow version: 3.1.0.dev578
---
Size of array: 0
---
Size of array: 4
Traceback (most recent call last):
  File "/Users/xxx/empty_array_buffer_size.py", line 47, in 
df = table.to_pandas()
  File "pyarrow/array.pxi", line 756, in 
pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1740, in pyarrow.lib.Table._to_pandas
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
 line 794, in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
 line 1135, in _table_to_blocks
return [_reconstruct_block(item, columns, extension_columns)
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
 line 1135, in 
return [_reconstruct_block(item, columns, extension_columns)
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
 line 753, in _reconstruct_block
pd_ext_arr = pandas_dtype.__from_arrow__(arr)
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/integer.py",
 line 117, in __from_arrow__
data, mask = pyarrow_array_to_numpy_and_mask(arr, dtype=self.type)
  File 
"/usr/local/mambaforge/envs/pa_nightly/lib/python3.9/site-packages/pandas/core/arrays/_arrow_utils.py",
 line 32, in pyarrow_array_to_numpy_and_mask
data = np.frombuffer(buflist[1], dtype=dtype)[arr.offset : arr.offset + 
len(arr)]
ValueError: buffer size must be a multiple of element size
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)