[jira] [Commented] (ARROW-10640) [C++] A "where" kernel to combine two arrays based on a mask

2020-11-23 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237243#comment-17237243
 ] 

Joris Van den Bossche commented on ARROW-10640:
---

And in SQL, this is related to a "CASE WHEN" clause: 
https://www.postgresqltutorial.com/postgresql-case/ (although here you can 
provide multiple boolean conditions, and it's also using expressions instead of 
actual boolean masks, but that's something to optimize only when there is a 
query engine)

> [C++] A "where" kernel to combine two arrays based on a mask
> 
>
> Key: ARROW-10640
> URL: https://issues.apache.org/jira/browse/ARROW-10640
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>
> (from discussion in ARROW-9489 with [~maartenbreddels])
> A general "where" kernel like {{np.where}} 
> (https://numpy.org/doc/stable/reference/generated/numpy.where.html) seems a 
> generally useful kernel to have, and could also help mimicking some other 
> python (setitem-like) operations. 
> The concrete use case in ARROW-9489 is to basically do a 
> {{fill_null(array[string], array[string])}} which could be expressed as 
> {{where(is_null(arr), arr2, arr)}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-23 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237262#comment-17237262
 ] 

Joris Van den Bossche commented on ARROW-10517:
---

[~ldacey] thanks for the feedback. I think it is certainly reasonable to gather 
the many small files into a single larger file at certain timestamps. 
Now, to be clear, there is still an alternative to {{partition_filename_cb}} in 
the new {{write_dataset}} function: the {{basename_template}} keyword (see 
https://github.com/apache/arrow/blob/6cea669a0a7fb836a555f3d87177b2517543ddb5/python/pyarrow/dataset.py#L713-L717
 for the docstring). 

So the new keyword is no longer a callback _function_, but should still allow 
to specify the date as the base name for the written file, I think (so it's 
more the question for you if this new keyword also allows you to do what you 
want to achieve). 

> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
> Fix For: 2.0.0
>
> Attachments: ss.PNG, ss2.PNG
>
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", 
> pa.string()), ("month", pa.string()), ("day", pa.string())])), 
>  schema=table.schema,
>  filesystem=fs, 
> )
> {code}
>  I think this is due to early versions of adlfs having mkdir(). Although I 
> use write_to_dataset and write_table all of the time, so I am not sure why 
> this would be an issue.
> {code:python}
> ---
> RuntimeError  Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 ds.write_dataset(data=table, 
>  16  base_dir="dev/test7",
>  17  basename_template=None,
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> write_dataset(data, base_dir, basename_template, format, partitioning, 
> schema, filesystem, file_options, use_threads)
> 771 filesystem, _ = _ensure_fs(filesystem)
> 772 
> --> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> 775 filesystem, partitioning, file_options, use_threads,
> /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs._cb_create_dir()
> /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, 
> path, recursive)
> 226 def create_dir(self, path, recursive):
> 227 # mkdir also raises FileNotFoundError when base directory is 
> not found
> --> 228 self.fs.mkdir(path, create_parents=recursive)
> 229 
> 230 def delete_dir(self, path):
> /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
> delimiter, exists_ok, **kwargs)
> 561 else:
> 562 ## everything else
> --> 563 raise RuntimeError(f"Cannot create 
> {container_name}{delimiter}{path}.")
> 564 else:
> 565 if container_name in self.ls("") and path:
> RuntimeError: Cannot create dev/test7/2020/01/28.
> {code}
>  
> Next, if I try to read a dataset (keep in mind that this works with 
> read_table and ParquetDataset):
> {code:python}
> ds.dataset(source="dev/staging/evaluations", 
>format="parquet", 
>partitioning="hive",
>exclude_invalid_files=False,
>filesystem=fs
>   )
> {code}
>  
> This doesn't seem to respect the filesystem connected to Azure Blob.
> {code:python}
> ---
> FileNotFoundError Traceback (most recent call last)
>  in 
> > 1 ds.dataset(source="

[jira] [Created] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

2020-11-23 Thread Lance Dacey (Jira)
Lance Dacey created ARROW-10694:
---

 Summary: [Python] ds.write_dataset() generates empty files for 
each final partition
 Key: ARROW-10694
 URL: https://issues.apache.org/jira/browse/ARROW-10694
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 2.0.0
 Environment: Ubuntu 18.04
Python 3.8.6
adlfs master branch
Reporter: Lance Dacey


ds.write_dataset() is generating empty files for the final partition folder 
which causes errors when reading the dataset or converting a dataset to a table.

I believe this may be caused by fs.mkdir(). Without the final slash in the 
path, an empty file is created in the "dev" container:

 
{code:java}
fs = fsspec.filesystem(protocol='abfs', account_name=base.login, 
account_key=base.password)
fs.mkdir("dev/test2")
{code}
 

If the final slash is added, a proper folder is created:
{code:java}
fs.mkdir("dev/test2/"){code}
 

Here is a full example of what happens with ds.write_dataset:
{code:java}
schema = pa.schema(
[
("year", pa.int16()),
("month", pa.int8()),
("day", pa.int8()),
("report_date", pa.date32()),
("employee_id", pa.string()),
("designation", pa.dictionary(index_type=pa.int16(), 
value_type=pa.string())),
]
)

part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", 
pa.int8()), ("day", pa.int8())]))

ds.write_dataset(data=table, 
 base_dir="dev/test-dataset", 
 basename_template="test-{i}.parquet", 
 format="parquet",
 partitioning=part, 
 schema=schema,
 filesystem=fs)

dataset.files

#sample printed below, note the empty files
[
 'dev/test-dataset/2018/1/1/test-0.parquet',
 'dev/test-dataset/2018/10/1',
 'dev/test-dataset/2018/10/1/test-27.parquet',
 'dev/test-dataset/2018/3/1',
 'dev/test-dataset/2018/3/1/test-6.parquet',
 'dev/test-dataset/2020/1/1',
 'dev/test-dataset/2020/1/1/test-2.parquet',
 'dev/test-dataset/2020/10/1',
 'dev/test-dataset/2020/10/1/test-29.parquet',
 'dev/test-dataset/2020/11/1',
 'dev/test-dataset/2020/11/1/test-32.parquet',
 'dev/test-dataset/2020/2/1',
 'dev/test-dataset/2020/2/1/test-5.parquet',
 'dev/test-dataset/2020/7/1',
 'dev/test-dataset/2020/7/1/test-20.parquet',
 'dev/test-dataset/2020/8/1',
 'dev/test-dataset/2020/8/1/test-23.parquet',
 'dev/test-dataset/2020/9/1',
 'dev/test-dataset/2020/9/1/test-26.parquet'
]{code}
As you can see, there is an empty file for each "day" partition. I was not even 
able to read the dataset at all until I manually deleted the first empty file 
in the dataset (2018/1/1).

I then get an error when I try to use the to_table() method:
{code:java}
OSError   Traceback (most recent call last)
 in 
> 1 
dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx 
in 
pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx
 in 
pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
 in 
pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
 in pyarrow.lib.check_status()OSError: Could not open parquet input source 
'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes
{code}
If I manually delete the empty file, I can then use the to_table() function:
{code:java}
dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 
10)).to_pandas()
{code}
Is this a bug with pyarrow, adlfs, or fsspec?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset

2020-11-23 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10695:
-

 Summary: [C++][Dataset] Allow to use a UUID in the 
basename_template when writing a dataset
 Key: ARROW-10695
 URL: https://issues.apache.org/jira/browse/ARROW-10695
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Currently we allow the user to specify a {{basename_template}}, and this can 
include a {{"\{i\}"}} part to replace it with an automatically incremented 
integer (so each generated file written to a single partition is unique):

https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717

It _might_ be useful to also have the ability to use a UUID, to ensure the file 
is unique in general (not only for a single write) and to mimic the behaviour 
of the old {{write_to_dataset}} implementation.

For example, we could look for a {{"\{uuid\}"}} in the template string, and if 
present replace it for each file with a new UUID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

2020-11-23 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237284#comment-17237284
 ] 

Joris Van den Bossche commented on ARROW-10694:
---

[~ldacey] specifically for the reading part, there is an option to exclude 
invalid files in the {{ds.dataset(..)}} function, by specifying 
{{exclude_invalid_files=True}} (the docs seem incorrect to indicate the default 
is True, I think it is actually False). 

Now, of course, that's only a workaround, as I fully agree those empty files 
shouldn't be created in the first place. 
As you mention, this seems to be the behaviour of {{fs.mkdir()}}, so I think we 
should rather discuss this in the {{adlfs}} project.

> [Python] ds.write_dataset() generates empty files for each final partition
> --
>
> Key: ARROW-10694
> URL: https://issues.apache.org/jira/browse/ARROW-10694
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> Python 3.8.6
> adlfs master branch
>Reporter: Lance Dacey
>Priority: Major
>
> ds.write_dataset() is generating empty files for the final partition folder 
> which causes errors when reading the dataset or converting a dataset to a 
> table.
> I believe this may be caused by fs.mkdir(). Without the final slash in the 
> path, an empty file is created in the "dev" container:
>  
> {code:java}
> fs = fsspec.filesystem(protocol='abfs', account_name=base.login, 
> account_key=base.password)
> fs.mkdir("dev/test2")
> {code}
>  
> If the final slash is added, a proper folder is created:
> {code:java}
> fs.mkdir("dev/test2/"){code}
>  
> Here is a full example of what happens with ds.write_dataset:
> {code:java}
> schema = pa.schema(
> [
> ("year", pa.int16()),
> ("month", pa.int8()),
> ("day", pa.int8()),
> ("report_date", pa.date32()),
> ("employee_id", pa.string()),
> ("designation", pa.dictionary(index_type=pa.int16(), 
> value_type=pa.string())),
> ]
> )
> part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", 
> pa.int8()), ("day", pa.int8())]))
> ds.write_dataset(data=table, 
>  base_dir="dev/test-dataset", 
>  basename_template="test-{i}.parquet", 
>  format="parquet",
>  partitioning=part, 
>  schema=schema,
>  filesystem=fs)
> dataset.files
> #sample printed below, note the empty files
> [
>  'dev/test-dataset/2018/1/1/test-0.parquet',
>  'dev/test-dataset/2018/10/1',
>  'dev/test-dataset/2018/10/1/test-27.parquet',
>  'dev/test-dataset/2018/3/1',
>  'dev/test-dataset/2018/3/1/test-6.parquet',
>  'dev/test-dataset/2020/1/1',
>  'dev/test-dataset/2020/1/1/test-2.parquet',
>  'dev/test-dataset/2020/10/1',
>  'dev/test-dataset/2020/10/1/test-29.parquet',
>  'dev/test-dataset/2020/11/1',
>  'dev/test-dataset/2020/11/1/test-32.parquet',
>  'dev/test-dataset/2020/2/1',
>  'dev/test-dataset/2020/2/1/test-5.parquet',
>  'dev/test-dataset/2020/7/1',
>  'dev/test-dataset/2020/7/1/test-20.parquet',
>  'dev/test-dataset/2020/8/1',
>  'dev/test-dataset/2020/8/1/test-23.parquet',
>  'dev/test-dataset/2020/9/1',
>  'dev/test-dataset/2020/9/1/test-26.parquet'
> ]{code}
> As you can see, there is an empty file for each "day" partition. I was not 
> even able to read the dataset at all until I manually deleted the first empty 
> file in the dataset (2018/1/1).
> I then get an error when I try to use the to_table() method:
> {code:java}
> OSError   Traceback (most recent call last)
>  in 
> > 1 
> dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx 
> in 
> pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx
>  in 
> pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in 
> pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()OSError: Could not open parquet input source 
> 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes
> {code}
> If I manually delete the empty file, I can then use the to_table() function:
> {code:java}
> dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 
> 10)).to_pandas()
> {code}
> Is this a bug with pyarrow, adlfs, or fsspec?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

2020-11-23 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237299#comment-17237299
 ] 

Lance Dacey commented on ARROW-10694:
-

Sure. https://github.com/dask/adlfs/issues/137

I tried the exclude_invalid_files argument but ran into an error:

 
{code:java}
dataset = ds.dataset(source="dev/test-dataset", 
 format="parquet", 
 partitioning=partition,
 exclude_invalid_files=True,
 filesystem=fs)

---
FileNotFoundError Traceback (most recent call last)
 in 
> 1 dataset = ds.dataset(source="dev/test-dataset", 
  2  format="parquet",
  3  partitioning=partition,
  4  exclude_invalid_files=True,
  5  filesystem=fs)

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
schema, format, filesystem, partitioning, partition_base_dir, 
exclude_invalid_files, ignore_prefixes)
669 # TODO(kszucs): support InMemoryDataset for a table input
670 if _is_path_like(source):
--> 671 return _filesystem_dataset(source, **kwargs)
672 elif isinstance(source, (tuple, list)):
673 if all(_is_path_like(elem) for elem in source):

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
_filesystem_dataset(source, schema, filesystem, partitioning, format, 
partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
434 selector_ignore_prefixes=selector_ignore_prefixes
435 )
--> 436 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, 
options)
437 
438 return factory.finish(schema)

/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
pyarrow._dataset.FileSystemDatasetFactory.__init__()

/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

/opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs._cb_open_input_file()

/opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in open_input_file(self, 
path)
274 
275 if not self.fs.isfile(path):
--> 276 raise FileNotFoundError(path)
277 
278 return PythonFile(self.fs.open(path, mode="rb"), mode="r")

FileNotFoundError: dev/test-dataset/2018/1/1
{code}
That folder and the empty file exists though:
{code:java}
for file in fs.find("dev/test-dataset"):
print(file)

dev/test-dataset/2018/1/1
dev/test-dataset/2018/1/1/test-0.parquet
dev/test-dataset/2018/10/1
dev/test-dataset/2018/10/1/test-27.parquet
dev/test-dataset/2018/11/1
dev/test-dataset/2018/11/1/test-30.parquet
dev/test-dataset/2018/12/1
dev/test-dataset/2018/12/1/test-33.parquet
dev/test-dataset/2018/2/1
dev/test-dataset/2018/2/1/test-3.parquet

{code}
 

> [Python] ds.write_dataset() generates empty files for each final partition
> --
>
> Key: ARROW-10694
> URL: https://issues.apache.org/jira/browse/ARROW-10694
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> Python 3.8.6
> adlfs master branch
>Reporter: Lance Dacey
>Priority: Major
>
> ds.write_dataset() is generating empty files for the final partition folder 
> which causes errors when reading the dataset or converting a dataset to a 
> table.
> I believe this may be caused by fs.mkdir(). Without the final slash in the 
> path, an empty file is created in the "dev" container:
>  
> {code:java}
> fs = fsspec.filesystem(protocol='abfs', account_name=base.login, 
> account_key=base.password)
> fs.mkdir("dev/test2")
> {code}
>  
> If the final slash is added, a proper folder is created:
> {code:java}
> fs.mkdir("dev/test2/"){code}
>  
> Here is a full example of what happens with ds.write_dataset:
> {code:java}
> schema = pa.schema(
> [
> ("year", pa.int16()),
> ("month", pa.int8()),
> ("day", pa.int8()),
> ("report_date", pa.date32()),
> ("employee_id", pa.string()),
> ("designation", pa.dictionary(index_type=pa.int16(), 
> value_type=pa.string())),
> ]
> )
> part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", 
> pa.int8()), ("day", pa.int8())]))
> ds.write_dataset(data=table, 
>  base_dir="dev/test-dataset", 
>  basename_template="test-{i}.parquet", 
>  format="parquet",
>  partitioning=part, 
>  schema=schema,
>  filesystem=fs)
> dataset.files
> #sample printed below, note the empty files
> [
>  'dev/test-dataset/2018/1/1/test-0.parquet',
>  'dev/test-dataset/2018/10

[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

2020-11-23 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237304#comment-17237304
 ] 

Joris Van den Bossche commented on ARROW-10694:
---

Can you check what {{fs.isfile("dev/test-dataset/2018/1/1")}} gives? And 
{{fs.info("dev/test-dataset/2018/1/1", detailTrue)}} ?

> [Python] ds.write_dataset() generates empty files for each final partition
> --
>
> Key: ARROW-10694
> URL: https://issues.apache.org/jira/browse/ARROW-10694
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> Python 3.8.6
> adlfs master branch
>Reporter: Lance Dacey
>Priority: Major
>
> ds.write_dataset() is generating empty files for the final partition folder 
> which causes errors when reading the dataset or converting a dataset to a 
> table.
> I believe this may be caused by fs.mkdir(). Without the final slash in the 
> path, an empty file is created in the "dev" container:
>  
> {code:java}
> fs = fsspec.filesystem(protocol='abfs', account_name=base.login, 
> account_key=base.password)
> fs.mkdir("dev/test2")
> {code}
>  
> If the final slash is added, a proper folder is created:
> {code:java}
> fs.mkdir("dev/test2/"){code}
>  
> Here is a full example of what happens with ds.write_dataset:
> {code:java}
> schema = pa.schema(
> [
> ("year", pa.int16()),
> ("month", pa.int8()),
> ("day", pa.int8()),
> ("report_date", pa.date32()),
> ("employee_id", pa.string()),
> ("designation", pa.dictionary(index_type=pa.int16(), 
> value_type=pa.string())),
> ]
> )
> part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", 
> pa.int8()), ("day", pa.int8())]))
> ds.write_dataset(data=table, 
>  base_dir="dev/test-dataset", 
>  basename_template="test-{i}.parquet", 
>  format="parquet",
>  partitioning=part, 
>  schema=schema,
>  filesystem=fs)
> dataset.files
> #sample printed below, note the empty files
> [
>  'dev/test-dataset/2018/1/1/test-0.parquet',
>  'dev/test-dataset/2018/10/1',
>  'dev/test-dataset/2018/10/1/test-27.parquet',
>  'dev/test-dataset/2018/3/1',
>  'dev/test-dataset/2018/3/1/test-6.parquet',
>  'dev/test-dataset/2020/1/1',
>  'dev/test-dataset/2020/1/1/test-2.parquet',
>  'dev/test-dataset/2020/10/1',
>  'dev/test-dataset/2020/10/1/test-29.parquet',
>  'dev/test-dataset/2020/11/1',
>  'dev/test-dataset/2020/11/1/test-32.parquet',
>  'dev/test-dataset/2020/2/1',
>  'dev/test-dataset/2020/2/1/test-5.parquet',
>  'dev/test-dataset/2020/7/1',
>  'dev/test-dataset/2020/7/1/test-20.parquet',
>  'dev/test-dataset/2020/8/1',
>  'dev/test-dataset/2020/8/1/test-23.parquet',
>  'dev/test-dataset/2020/9/1',
>  'dev/test-dataset/2020/9/1/test-26.parquet'
> ]{code}
> As you can see, there is an empty file for each "day" partition. I was not 
> even able to read the dataset at all until I manually deleted the first empty 
> file in the dataset (2018/1/1).
> I then get an error when I try to use the to_table() method:
> {code:java}
> OSError   Traceback (most recent call last)
>  in 
> > 1 
> dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx 
> in 
> pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx
>  in 
> pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in 
> pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()OSError: Could not open parquet input source 
> 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes
> {code}
> If I manually delete the empty file, I can then use the to_table() function:
> {code:java}
> dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 
> 10)).to_pandas()
> {code}
> Is this a bug with pyarrow, adlfs, or fsspec?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

2020-11-23 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237316#comment-17237316
 ] 

Lance Dacey commented on ARROW-10694:
-

{code:java}
print(fs.isfile("dev/test-dataset/2018/1/1"))
print(fs.info("dev/test-dataset/2018/1/1", detail=True)){code}
False
{'name': 'dev/test-dataset/2018/1/1/', 'size': 0, 'type': 'directory'}

 

> [Python] ds.write_dataset() generates empty files for each final partition
> --
>
> Key: ARROW-10694
> URL: https://issues.apache.org/jira/browse/ARROW-10694
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> Python 3.8.6
> adlfs master branch
>Reporter: Lance Dacey
>Priority: Major
>
> ds.write_dataset() is generating empty files for the final partition folder 
> which causes errors when reading the dataset or converting a dataset to a 
> table.
> I believe this may be caused by fs.mkdir(). Without the final slash in the 
> path, an empty file is created in the "dev" container:
>  
> {code:java}
> fs = fsspec.filesystem(protocol='abfs', account_name=base.login, 
> account_key=base.password)
> fs.mkdir("dev/test2")
> {code}
>  
> If the final slash is added, a proper folder is created:
> {code:java}
> fs.mkdir("dev/test2/"){code}
>  
> Here is a full example of what happens with ds.write_dataset:
> {code:java}
> schema = pa.schema(
> [
> ("year", pa.int16()),
> ("month", pa.int8()),
> ("day", pa.int8()),
> ("report_date", pa.date32()),
> ("employee_id", pa.string()),
> ("designation", pa.dictionary(index_type=pa.int16(), 
> value_type=pa.string())),
> ]
> )
> part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", 
> pa.int8()), ("day", pa.int8())]))
> ds.write_dataset(data=table, 
>  base_dir="dev/test-dataset", 
>  basename_template="test-{i}.parquet", 
>  format="parquet",
>  partitioning=part, 
>  schema=schema,
>  filesystem=fs)
> dataset.files
> #sample printed below, note the empty files
> [
>  'dev/test-dataset/2018/1/1/test-0.parquet',
>  'dev/test-dataset/2018/10/1',
>  'dev/test-dataset/2018/10/1/test-27.parquet',
>  'dev/test-dataset/2018/3/1',
>  'dev/test-dataset/2018/3/1/test-6.parquet',
>  'dev/test-dataset/2020/1/1',
>  'dev/test-dataset/2020/1/1/test-2.parquet',
>  'dev/test-dataset/2020/10/1',
>  'dev/test-dataset/2020/10/1/test-29.parquet',
>  'dev/test-dataset/2020/11/1',
>  'dev/test-dataset/2020/11/1/test-32.parquet',
>  'dev/test-dataset/2020/2/1',
>  'dev/test-dataset/2020/2/1/test-5.parquet',
>  'dev/test-dataset/2020/7/1',
>  'dev/test-dataset/2020/7/1/test-20.parquet',
>  'dev/test-dataset/2020/8/1',
>  'dev/test-dataset/2020/8/1/test-23.parquet',
>  'dev/test-dataset/2020/9/1',
>  'dev/test-dataset/2020/9/1/test-26.parquet'
> ]{code}
> As you can see, there is an empty file for each "day" partition. I was not 
> even able to read the dataset at all until I manually deleted the first empty 
> file in the dataset (2018/1/1).
> I then get an error when I try to use the to_table() method:
> {code:java}
> OSError   Traceback (most recent call last)
>  in 
> > 1 
> dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx 
> in 
> pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx
>  in 
> pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in 
> pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()OSError: Could not open parquet input source 
> 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes
> {code}
> If I manually delete the empty file, I can then use the to_table() function:
> {code:java}
> dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 
> 10)).to_pandas()
> {code}
> Is this a bug with pyarrow, adlfs, or fsspec?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

2020-11-23 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237341#comment-17237341
 ] 

Lance Dacey commented on ARROW-10694:
-

FYI, I tested HivePartitioning as well, but faced the same issue. 

 
{code:java}
from pyarrow.dataset import HivePartitioning 

partition = HivePartitioning(pa.schema([("year", pa.int16()), ("month", 
pa.int8()), ("day", pa.int8())]))

FileNotFoundError: dev/test-dataset2/year=2018/month=1/day=1{code}

> [Python] ds.write_dataset() generates empty files for each final partition
> --
>
> Key: ARROW-10694
> URL: https://issues.apache.org/jira/browse/ARROW-10694
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> Python 3.8.6
> adlfs master branch
>Reporter: Lance Dacey
>Priority: Major
>
> ds.write_dataset() is generating empty files for the final partition folder 
> which causes errors when reading the dataset or converting a dataset to a 
> table.
> I believe this may be caused by fs.mkdir(). Without the final slash in the 
> path, an empty file is created in the "dev" container:
>  
> {code:java}
> fs = fsspec.filesystem(protocol='abfs', account_name=base.login, 
> account_key=base.password)
> fs.mkdir("dev/test2")
> {code}
>  
> If the final slash is added, a proper folder is created:
> {code:java}
> fs.mkdir("dev/test2/"){code}
>  
> Here is a full example of what happens with ds.write_dataset:
> {code:java}
> schema = pa.schema(
> [
> ("year", pa.int16()),
> ("month", pa.int8()),
> ("day", pa.int8()),
> ("report_date", pa.date32()),
> ("employee_id", pa.string()),
> ("designation", pa.dictionary(index_type=pa.int16(), 
> value_type=pa.string())),
> ]
> )
> part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", 
> pa.int8()), ("day", pa.int8())]))
> ds.write_dataset(data=table, 
>  base_dir="dev/test-dataset", 
>  basename_template="test-{i}.parquet", 
>  format="parquet",
>  partitioning=part, 
>  schema=schema,
>  filesystem=fs)
> dataset.files
> #sample printed below, note the empty files
> [
>  'dev/test-dataset/2018/1/1/test-0.parquet',
>  'dev/test-dataset/2018/10/1',
>  'dev/test-dataset/2018/10/1/test-27.parquet',
>  'dev/test-dataset/2018/3/1',
>  'dev/test-dataset/2018/3/1/test-6.parquet',
>  'dev/test-dataset/2020/1/1',
>  'dev/test-dataset/2020/1/1/test-2.parquet',
>  'dev/test-dataset/2020/10/1',
>  'dev/test-dataset/2020/10/1/test-29.parquet',
>  'dev/test-dataset/2020/11/1',
>  'dev/test-dataset/2020/11/1/test-32.parquet',
>  'dev/test-dataset/2020/2/1',
>  'dev/test-dataset/2020/2/1/test-5.parquet',
>  'dev/test-dataset/2020/7/1',
>  'dev/test-dataset/2020/7/1/test-20.parquet',
>  'dev/test-dataset/2020/8/1',
>  'dev/test-dataset/2020/8/1/test-23.parquet',
>  'dev/test-dataset/2020/9/1',
>  'dev/test-dataset/2020/9/1/test-26.parquet'
> ]{code}
> As you can see, there is an empty file for each "day" partition. I was not 
> even able to read the dataset at all until I manually deleted the first empty 
> file in the dataset (2018/1/1).
> I then get an error when I try to use the to_table() method:
> {code:java}
> OSError   Traceback (most recent call last)
>  in 
> > 1 
> dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx 
> in 
> pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx
>  in 
> pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in 
> pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()OSError: Could not open parquet input source 
> 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes
> {code}
> If I manually delete the empty file, I can then use the to_table() function:
> {code:java}
> dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 
> 10)).to_pandas()
> {code}
> Is this a bug with pyarrow, adlfs, or fsspec?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10143) [C++] ArrayRangeEquals should accept EqualOptions

2020-11-23 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10143.

Resolution: Fixed

Issue resolved by pull request 8703
[https://github.com/apache/arrow/pull/8703]

> [C++] ArrayRangeEquals should accept EqualOptions
> -
>
> Key: ARROW-10143
> URL: https://issues.apache.org/jira/browse/ARROW-10143
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Besides, the underlying implementations of ArrayEquals and ArrayRangeEquals 
> should be shared (right now they are duplicated).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10696) [C++] Investigate a bit run reader that would only return runs of set bits

2020-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10696:
--

 Summary: [C++] Investigate a bit run reader that would only return 
runs of set bits
 Key: ARROW-10696
 URL: https://issues.apache.org/jira/browse/ARROW-10696
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Followup to PR discussion: 
https://github.com/apache/arrow/pull/8703#discussion_r526263665



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10697) [C++] Consolidate bitmap word readers

2020-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10697:
--

 Summary: [C++] Consolidate bitmap word readers
 Key: ARROW-10697
 URL: https://issues.apache.org/jira/browse/ARROW-10697
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


We currently have {{BitmapWordReader}}, {{BitmapUInt64Reader}} and 
{{Bitmap::VisitWords}}.

We should try to consolidate those, assuming benchmarks don't regress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10698) [C++] Optimize union equality comparison

2020-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10698:
--

 Summary: [C++] Optimize union equality comparison
 Key: ARROW-10698
 URL: https://issues.apache.org/jira/browse/ARROW-10698
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Antoine Pitrou


Currently, union array comparison in {{ArrayRangeEqual}} computes child 
equality over single union elements. This adds a large per-element comparison 
overhead. At least for sparse unions, it may be beneficial to detect contiguous 
runs of child ids and run child comparisons on entire runs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10699) [C++] BitmapUInt64Reader doesn't work on big-endian

2020-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10699:
--

 Summary: [C++] BitmapUInt64Reader doesn't work on big-endian
 Key: ARROW-10699
 URL: https://issues.apache.org/jira/browse/ARROW-10699
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou


I didn't notice this when merging ARROW-10655 (the s390x CI is allowed to fail).
https://travis-ci.com/github/apache/arrow/jobs/445803711#L3534





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10699) [C++] BitmapUInt64Reader doesn't work on big-endian

2020-11-23 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237367#comment-17237367
 ] 

Antoine Pitrou commented on ARROW-10699:


[~kiszk] I can try to fix it, unless you want to do it.

> [C++] BitmapUInt64Reader doesn't work on big-endian
> ---
>
> Key: ARROW-10699
> URL: https://issues.apache.org/jira/browse/ARROW-10699
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> I didn't notice this when merging ARROW-10655 (the s390x CI is allowed to 
> fail).
> https://travis-ci.com/github/apache/arrow/jobs/445803711#L3534



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10426) [C++] Arrow type large_string cannot be written to Parquet type column descriptor

2020-11-23 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10426.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 8632
[https://github.com/apache/arrow/pull/8632]

> [C++] Arrow type large_string cannot be written to Parquet type column 
> descriptor
> -
>
> Key: ARROW-10426
> URL: https://issues.apache.org/jira/browse/ARROW-10426
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 2.0.0
> Environment: R 4.0.3 on OSX 10.15.7
>Reporter: Gabriel Bassett
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: parquet, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> When trying to write a dataset in parquet format, arrow errors with the 
> message: "Arrow type large_string cannot be written to Parquet type column 
> descriptor"
> {code:java}
> arrow::write_dataset(
>  dataframe,
>  "/directory/",
>  "parquet",
>  "partitioning" = c("col1", "col2")
> )
> {code}
> The dataframe in question is very large with one column containing the text 
> of message board posts encoded in HTML.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10700) [C++] Warning "ignoring unknown option '-mbmi2'" on MSVC

2020-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10700:
--

 Summary: [C++] Warning "ignoring unknown option '-mbmi2'" on MSVC
 Key: ARROW-10700
 URL: https://issues.apache.org/jira/browse/ARROW-10700
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


Seen on Github Actions:
https://github.com/apache/arrow/pull/8716/checks?check_run_id=1442252599#step:7:792

{code}
  Generating Code...
  level_comparison_avx2.cc
cl : command line warning D9002: ignoring unknown option '-mbmi2' 
[D:\a\arrow\arrow\build\cpp\src\parquet\parquet_shared.vcxproj]
  level_conversion_bmi2.cc
{code}

This may possibly affect performance too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10700) [C++] Warning "ignoring unknown option '-mbmi2'" on MSVC

2020-11-23 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237399#comment-17237399
 ] 

Antoine Pitrou commented on ARROW-10700:


cc [~emkornfield]

> [C++] Warning "ignoring unknown option '-mbmi2'" on MSVC
> 
>
> Key: ARROW-10700
> URL: https://issues.apache.org/jira/browse/ARROW-10700
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
>
> Seen on Github Actions:
> https://github.com/apache/arrow/pull/8716/checks?check_run_id=1442252599#step:7:792
> {code}
>   Generating Code...
>   level_comparison_avx2.cc
> cl : command line warning D9002: ignoring unknown option '-mbmi2' 
> [D:\a\arrow\arrow\build\cpp\src\parquet\parquet_shared.vcxproj]
>   level_conversion_bmi2.cc
> {code}
> This may possibly affect performance too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10699) [C++] BitmapUInt64Reader doesn't work on big-endian

2020-11-23 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-10699:
--

Assignee: Antoine Pitrou

> [C++] BitmapUInt64Reader doesn't work on big-endian
> ---
>
> Key: ARROW-10699
> URL: https://issues.apache.org/jira/browse/ARROW-10699
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> I didn't notice this when merging ARROW-10655 (the s390x CI is allowed to 
> fail).
> https://travis-ci.com/github/apache/arrow/jobs/445803711#L3534



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10699) [C++] BitmapUInt64Reader doesn't work on big-endian

2020-11-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10699:
---
Labels: pull-request-available  (was: )

> [C++] BitmapUInt64Reader doesn't work on big-endian
> ---
>
> Key: ARROW-10699
> URL: https://issues.apache.org/jira/browse/ARROW-10699
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I didn't notice this when merging ARROW-10655 (the s390x CI is allowed to 
> fail).
> https://travis-ci.com/github/apache/arrow/jobs/445803711#L3534



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10701) [Rust] [Datafusion] Benchmark sort_limit_query_sql fails because order by clause specifies column index instead of expression

2020-11-23 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörn Horstmann reassigned ARROW-10701:
--

Assignee: Jörn Horstmann

> [Rust] [Datafusion] Benchmark sort_limit_query_sql fails because order by 
> clause specifies column index instead of expression
> -
>
> Key: ARROW-10701
> URL: https://issues.apache.org/jira/browse/ARROW-10701
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>
> I probably introduced this bug some time ago, but there was another bug in 
> the benchmark setup that caused the query to not be executed, only planned.
> Datafusion should probably also support queries like
> SELECT foo, bar
>   FROM table
>  ORDER BY 1, 2
> But for now the easiest fix for the benchmark would be to specify the column 
> name instead of the index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10701) [Rust] [Datafusion] Benchmark sort_limit_query_sql fails because order by clause specifies column index instead of expression

2020-11-23 Thread Jira
Jörn Horstmann created ARROW-10701:
--

 Summary: [Rust] [Datafusion] Benchmark sort_limit_query_sql fails 
because order by clause specifies column index instead of expression
 Key: ARROW-10701
 URL: https://issues.apache.org/jira/browse/ARROW-10701
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Jörn Horstmann


I probably introduced this bug some time ago, but there was another bug in the 
benchmark setup that caused the query to not be executed, only planned.

Datafusion should probably also support queries like

SELECT foo, bar
  FROM table
 ORDER BY 1, 2

But for now the easiest fix for the benchmark would be to specify the column 
name instead of the index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10646) [C++][FlightRPC] Disable flaky test

2020-11-23 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10646.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 8707
[https://github.com/apache/arrow/pull/8707]

> [C++][FlightRPC] Disable flaky test
> ---
>
> Key: ARROW-10646
> URL: https://issues.apache.org/jira/browse/ARROW-10646
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> One of the Flight tests is flaky on Windows, as it appears gRPC doesn't 
> always return us the address of the connected client. We can just disable 
> this part of the test since it's really testing gRPC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10701) [Rust] [Datafusion] Benchmark sort_limit_query_sql fails because order by clause specifies column index instead of expression

2020-11-23 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörn Horstmann updated ARROW-10701:
---
Component/s: Rust - DataFusion
 Rust

> [Rust] [Datafusion] Benchmark sort_limit_query_sql fails because order by 
> clause specifies column index instead of expression
> -
>
> Key: ARROW-10701
> URL: https://issues.apache.org/jira/browse/ARROW-10701
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>
> I probably introduced this bug some time ago, but there was another bug in 
> the benchmark setup that caused the query to not be executed, only planned.
> Datafusion should probably also support queries like
> SELECT foo, bar
>   FROM table
>  ORDER BY 1, 2
> But for now the easiest fix for the benchmark would be to specify the column 
> name instead of the index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10032) [Documentation] C++ Windows docs are out of date

2020-11-23 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10032.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 8706
[https://github.com/apache/arrow/pull/8706]

> [Documentation] C++ Windows docs are out of date
> 
>
> Key: ARROW-10032
> URL: https://issues.apache.org/jira/browse/ARROW-10032
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> "Replicating AppVeyor Builds" needs the following changes: 
> [https://arrow.apache.org/docs/developers/cpp/windows.html#replicating-appveyor-builds]
>  * The recommended VM does not include the C++ compiler - we should link to 
> the build tools and describe which of them needs installation
>  * Boost: the b2 script now requires --with not -with flags
>  * The batch script were renamed (appveyor-cpp-build/appveyor-cpp-setup)
>  * Prefer JOB=Build_Debug as otherwise it forces clcache
>  * BOOST_INCLUDEDIR must be set to C:\Boost\include\boost_VERSION
>  * Use conda manually to install gtest gflags ninja rapidjson grpc-cpp 
> protobuf
> Even with this:
>  * The developer prompt can't find cl.exe (the compiler). (You must restart 
> the VM!)
>  * The PowerShell prompt can't use conda (it complains a config file isn't 
> signed)
>  Solution: run a PowerShell instance as administrator and run 
> "Set-ExecutionPolicy -ExecutionPolicy Unrestricted"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10701) [Rust] [Datafusion] Benchmark sort_limit_query_sql fails because order by clause specifies column index instead of expression

2020-11-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10701:
---
Labels: pull-request-available  (was: )

> [Rust] [Datafusion] Benchmark sort_limit_query_sql fails because order by 
> clause specifies column index instead of expression
> -
>
> Key: ARROW-10701
> URL: https://issues.apache.org/jira/browse/ARROW-10701
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I probably introduced this bug some time ago, but there was another bug in 
> the benchmark setup that caused the query to not be executed, only planned.
> Datafusion should probably also support queries like
> SELECT foo, bar
>   FROM table
>  ORDER BY 1, 2
> But for now the easiest fix for the benchmark would be to specify the column 
> name instead of the index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10699) [C++] BitmapUInt64Reader doesn't work on big-endian

2020-11-23 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-10699.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 8744
[https://github.com/apache/arrow/pull/8744]

> [C++] BitmapUInt64Reader doesn't work on big-endian
> ---
>
> Key: ARROW-10699
> URL: https://issues.apache.org/jira/browse/ARROW-10699
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I didn't notice this when merging ARROW-10655 (the s390x CI is allowed to 
> fail).
> https://travis-ci.com/github/apache/arrow/jobs/445803711#L3534



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10610) [C++] arrow-utility-test and arrow-csv-test causes failures on a big-endian platform

2020-11-23 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-10610:
--

Assignee: Antoine Pitrou  (was: Kazuaki Ishizaki)

> [C++] arrow-utility-test and arrow-csv-test causes failures on a big-endian 
> platform
> 
>
> Key: ARROW-10610
> URL: https://issues.apache.org/jira/browse/ARROW-10610
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> After [https://github.com/apache/arrow/pull/8494] has been merged, the 
> following tests, which use {{1eN}} format, cause failures.
> {code}
> [ RUN  ] FloatingPointConversion.Basics
> /arrow/cpp/src/arrow/testing/gtest_util.cc:128: Failure
> Failed
> @@ -1, +1 @@
> --1e+30
> +0
> [  FAILED  ] FloatingPointConversion.Basics (3 ms)
> ...
> [ RUN  ] StringConversion.ToFloat
> /arrow/cpp/src/arrow/util/value_parsing_test.cc:35: Failure
> Expected equality of these values:
>   out
> Which is: 0
>   expected
> Which is: -1e+20
> Conversion failed for '-1e20'
> [  FAILED  ] StringConversion.ToFloat (0 ms)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10688) [Rust] [DataFusion] Support CASE WHEN from DataFrame API

2020-11-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10688:
---
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Support CASE WHEN from DataFrame API
> 
>
> Key: ARROW-10688
> URL: https://issues.apache.org/jira/browse/ARROW-10688
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Support CASE WHEN from DataFrame API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10702) [C++] Micro-optimize integer parsing

2020-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10702:
--

 Summary: [C++] Micro-optimize integer parsing
 Key: ARROW-10702
 URL: https://issues.apache.org/jira/browse/ARROW-10702
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou


It might be possible to optimize integer and decimal parsing using the 
following tricks from the {{fast_float}} library:
https://github.com/lemire/fast_float/blob/70c9b7f884c7f80a9a0e06fa9754c0a2e6a9492e/include/fast_float/ascii_number.h#L18-L38




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10685) [Rust] [DataFusion] Add support for join on filter pushdown optimizer

2020-11-23 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10685.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 8738
[https://github.com/apache/arrow/pull/8738]

> [Rust] [DataFusion] Add support for join on filter pushdown optimizer
> -
>
> Key: ARROW-10685
> URL: https://issues.apache.org/jira/browse/ARROW-10685
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10669) [C++][Compute] Support Scalar inputs to boolean kernels

2020-11-23 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-10669.
--
Resolution: Fixed

Issue resolved by pull request 8728
[https://github.com/apache/arrow/pull/8728]

> [C++][Compute] Support Scalar inputs to boolean kernels
> ---
>
> Key: ARROW-10669
> URL: https://issues.apache.org/jira/browse/ARROW-10669
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Currently only Invert supports scalar arguments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10703) [Rust] [DataFusion] Make join not collect left on every part

2020-11-23 Thread Jira
Jorge Leitão created ARROW-10703:


 Summary: [Rust] [DataFusion] Make join not collect left on every 
part
 Key: ARROW-10703
 URL: https://issues.apache.org/jira/browse/ARROW-10703
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge Leitão
Assignee: Jorge Leitão






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10610) [C++] arrow-utility-test and arrow-csv-test causes failures on a big-endian platform

2020-11-23 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10610.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 8747
[https://github.com/apache/arrow/pull/8747]

> [C++] arrow-utility-test and arrow-csv-test causes failures on a big-endian 
> platform
> 
>
> Key: ARROW-10610
> URL: https://issues.apache.org/jira/browse/ARROW-10610
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> After [https://github.com/apache/arrow/pull/8494] has been merged, the 
> following tests, which use {{1eN}} format, cause failures.
> {code}
> [ RUN  ] FloatingPointConversion.Basics
> /arrow/cpp/src/arrow/testing/gtest_util.cc:128: Failure
> Failed
> @@ -1, +1 @@
> --1e+30
> +0
> [  FAILED  ] FloatingPointConversion.Basics (3 ms)
> ...
> [ RUN  ] StringConversion.ToFloat
> /arrow/cpp/src/arrow/util/value_parsing_test.cc:35: Failure
> Expected equality of these values:
>   out
> Which is: 0
>   expected
> Which is: -1e+20
> Conversion failed for '-1e20'
> [  FAILED  ] StringConversion.ToFloat (0 ms)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10703) [Rust] [DataFusion] Make join not collect left on every part

2020-11-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10703:
---
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Make join not collect left on every part
> 
>
> Key: ARROW-10703
> URL: https://issues.apache.org/jira/browse/ARROW-10703
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10689) [Rust] [DataFusion] Support CASE WHEN from SQL

2020-11-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10689:
---
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Support CASE WHEN from SQL
> --
>
> Key: ARROW-10689
> URL: https://issues.apache.org/jira/browse/ARROW-10689
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Support CASE WHEN from SQL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10679) [Rust] [DataFusion] Implement SQL CASE WHEN physical expression

2020-11-23 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10679.

Resolution: Fixed

Issue resolved by pull request 8740
[https://github.com/apache/arrow/pull/8740]

> [Rust] [DataFusion] Implement SQL CASE WHEN physical expression
> ---
>
> Key: ARROW-10679
> URL: https://issues.apache.org/jira/browse/ARROW-10679
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Implement SQL CASE WHEN expression so that we can support TPC-H query 12 
> fully.
>  
> Postgres: [https://www.postgresqltutorial.com/postgresql-case/]
> Spark: 
> [http://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-case.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10704) Remove Nested from expression enum

2020-11-23 Thread Jira
Daniël Heres created ARROW-10704:


 Summary: Remove Nested from expression enum
 Key: ARROW-10704
 URL: https://issues.apache.org/jira/browse/ARROW-10704
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Daniël Heres


Remove Nested from expression enum. It's not needed and never produced/used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10704) [Rust][DataFusion] Remove Nested from expression enum

2020-11-23 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniël Heres updated ARROW-10704:
-
Summary: [Rust][DataFusion] Remove Nested from expression enum  (was: 
Remove Nested from expression enum)

> [Rust][DataFusion] Remove Nested from expression enum
> -
>
> Key: ARROW-10704
> URL: https://issues.apache.org/jira/browse/ARROW-10704
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Daniël Heres
>Priority: Minor
>
> Remove Nested from expression enum. It's not needed and never produced/used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10704) [Rust][DataFusion] Remove Nested from expression enum

2020-11-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10704:
---
Labels: pull-request-available  (was: )

> [Rust][DataFusion] Remove Nested from expression enum
> -
>
> Key: ARROW-10704
> URL: https://issues.apache.org/jira/browse/ARROW-10704
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Daniël Heres
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Remove Nested from expression enum. It's not needed and never produced/used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10584) [Rust] [DataFusion] Implement SQL join support

2020-11-23 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-10584:
--

Assignee: Andy Grove

> [Rust] [DataFusion] Implement SQL join support
> --
>
> Key: ARROW-10584
> URL: https://issues.apache.org/jira/browse/ARROW-10584
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> Update the SQL to DataFrame / LogicalPlan logic to support inner equijoin. 
> Suitable error should be returned for any unsupported join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10584) [Rust] [DataFusion] Implement SQL join support

2020-11-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10584:
---
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Implement SQL join support
> --
>
> Key: ARROW-10584
> URL: https://issues.apache.org/jira/browse/ARROW-10584
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Update the SQL to DataFrame / LogicalPlan logic to support inner equijoin. 
> Suitable error should be returned for any unsupported join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10705) [Rust] Lifetime annotations in the IPC writer are too strict, preventing code reuse

2020-11-23 Thread Carol Nichols (Jira)
Carol Nichols created ARROW-10705:
-

 Summary: [Rust] Lifetime annotations in the IPC writer are too 
strict, preventing code reuse
 Key: ARROW-10705
 URL: https://issues.apache.org/jira/browse/ARROW-10705
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Carol Nichols
Assignee: Carol Nichols


Will illustrate and explain more in the PR I'm about to open.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10667) [Rust] [Parquet] Add a convenience type for writing Parquet to memory

2020-11-23 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-10667.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 8726
[https://github.com/apache/arrow/pull/8726]

> [Rust] [Parquet] Add a convenience type for writing Parquet to memory
> -
>
> Key: ARROW-10667
> URL: https://issues.apache.org/jira/browse/ARROW-10667
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Carol Nichols
>Assignee: Carol Nichols
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Similar to the `SliceableCursor` type that provides a convenience for reading 
> Parquet from memory, I would like to propose a type to make it convenient to 
> write Parquet to memory. 
> This is possible for clients to implement today, but seems common enough to 
> want to provide for everyone to use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10705) [Rust] Lifetime annotations in the IPC writer are too strict, preventing code reuse

2020-11-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10705:
---
Labels: pull-request-available  (was: )

> [Rust] Lifetime annotations in the IPC writer are too strict, preventing code 
> reuse
> ---
>
> Key: ARROW-10705
> URL: https://issues.apache.org/jira/browse/ARROW-10705
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Carol Nichols
>Assignee: Carol Nichols
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Will illustrate and explain more in the PR I'm about to open.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10677) [Rust] Add tests as documentation showing supported csv parsing

2020-11-23 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-10677:

Component/s: Rust

> [Rust] Add tests as documentation showing supported csv parsing
> ---
>
> Key: ARROW-10677
> URL: https://issues.apache.org/jira/browse/ARROW-10677
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> https://github.com/apache/arrow/pull/8714/files# / ARROW-10654 added some 
> specialized parsing for the csv reader and among other things added 
> additional boolean parsing support. 
> We should add some tests as documentation of what boolean types are supported



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10677) [Rust] Fix Bug and Add tests as documentation showing supported csv parsing

2020-11-23 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-10677:

Summary: [Rust] Fix Bug and Add tests as documentation showing supported 
csv parsing  (was: [Rust] Add tests as documentation showing supported csv 
parsing)

> [Rust] Fix Bug and Add tests as documentation showing supported csv parsing
> ---
>
> Key: ARROW-10677
> URL: https://issues.apache.org/jira/browse/ARROW-10677
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> https://github.com/apache/arrow/pull/8714/files# / ARROW-10654 added some 
> specialized parsing for the csv reader and among other things added 
> additional boolean parsing support. 
> We should add some tests as documentation of what boolean types are supported



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10677) [Rust] Fix Bug and Add tests as documentation showing supported csv parsing

2020-11-23 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-10677:

Description: 
ARROW-10654 adds some specialized parsing for the csv reader but there was no 
unit test coverage and it introduced a bug

https://github.com/apache/arrow/pull/8714/files# / ARROW-10654 added some 
specialized parsing for the csv reader and among other things added additional 
boolean parsing support. 

We should add some tests as documentation of what boolean types are supported

  was:
https://github.com/apache/arrow/pull/8714/files# / ARROW-10654 added some 
specialized parsing for the csv reader and among other things added additional 
boolean parsing support. 

We should add some tests as documentation of what boolean types are supported


> [Rust] Fix Bug and Add tests as documentation showing supported csv parsing
> ---
>
> Key: ARROW-10677
> URL: https://issues.apache.org/jira/browse/ARROW-10677
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> ARROW-10654 adds some specialized parsing for the csv reader but there was no 
> unit test coverage and it introduced a bug
> https://github.com/apache/arrow/pull/8714/files# / ARROW-10654 added some 
> specialized parsing for the csv reader and among other things added 
> additional boolean parsing support. 
> We should add some tests as documentation of what boolean types are supported



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10677) [Rust] Fix Bug and Add tests as documentation showing supported csv parsing

2020-11-23 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-10677.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 8733
[https://github.com/apache/arrow/pull/8733]

> [Rust] Fix Bug and Add tests as documentation showing supported csv parsing
> ---
>
> Key: ARROW-10677
> URL: https://issues.apache.org/jira/browse/ARROW-10677
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> ARROW-10654 adds some specialized parsing for the csv reader but there was no 
> unit test coverage and it introduced a bug
> https://github.com/apache/arrow/pull/8714/files# / ARROW-10654 added some 
> specialized parsing for the csv reader and among other things added 
> additional boolean parsing support. 
> We should add some tests as documentation of what boolean types are supported



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10574) [Python][Parquet] Enhance hive partition filtering

2020-11-23 Thread Weiyang Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiyang Zhao updated ARROW-10574:
-
Description: 
I would like to enhance partition filters in methods such as:

{{pyarrow.parquet.ParquetDataset(path, filters)}}

I am proposing the below enhancements:
 # for operator "in", "not in", the value should be any typing.Iteratable (also 
a container). But currently only set is supported while other iteratable, such 
as list, tuple cannot function correctly. I would like to change it to accept 
any iteratable.
 # Enhance the documents about the partition filters.

I see there is a new version implemented with 
 _ParquetDatasetV2 which already accepts an iterable. So the documentation 
update is fine for the new version as well.
  

  was:
I would like to enhance partition filters in methods such as:

{{pyarrow.parquet.ParquetDataset(path, filters)}}

I am proposing the below enhancements:
 # for operator "in", "not in", the value should be any typing.Iteratable (also 
a container). But currently only set is supported while other iteratable, such 
as list, tuple cannot function correctly. I would like to change it to accept 
any iteratable.
 # Enhance the documents about the partition filters.
 # Check when no partition can satisfy the filters, raise an exception with 
meaningful error message.

I see there is a new version implemented with 
_ParquetDatasetV2 which already accepts an iterable. So the documentation 
update is fine for the new version as well.
 


> [Python][Parquet] Enhance hive partition filtering
> --
>
> Key: ARROW-10574
> URL: https://issues.apache.org/jira/browse/ARROW-10574
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weiyang Zhao
>Assignee: Weiyang Zhao
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I would like to enhance partition filters in methods such as:
> {{pyarrow.parquet.ParquetDataset(path, filters)}}
> I am proposing the below enhancements:
>  # for operator "in", "not in", the value should be any typing.Iteratable 
> (also a container). But currently only set is supported while other 
> iteratable, such as list, tuple cannot function correctly. I would like to 
> change it to accept any iteratable.
>  # Enhance the documents about the partition filters.
> I see there is a new version implemented with 
>  _ParquetDatasetV2 which already accepts an iterable. So the documentation 
> update is fine for the new version as well.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10706) [Python][Parquet] when filters end up with no partition, it will throw index out of range error.

2020-11-23 Thread Weiyang Zhao (Jira)
Weiyang Zhao created ARROW-10706:


 Summary: [Python][Parquet] when filters end up with no partition, 
it will throw index out of range error.
 Key: ARROW-10706
 URL: https://issues.apache.org/jira/browse/ARROW-10706
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Weiyang Zhao
Assignee: Weiyang Zhao


The below code will raise IndexError:

{{dataset = pq.ParquetDataset(}}
{{ base_path, filesystem=fs,}}
{{ filters=[('string', '=', "notExisted")],}}
{{ use_legacy_dataset=True}}
{{)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10574) [Python][Parquet] Enhance hive partition filtering for 'in', 'not in'

2020-11-23 Thread Weiyang Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiyang Zhao updated ARROW-10574:
-
Summary: [Python][Parquet] Enhance hive partition filtering for 'in', 'not 
in'  (was: [Python][Parquet] Enhance hive partition filtering)

> [Python][Parquet] Enhance hive partition filtering for 'in', 'not in'
> -
>
> Key: ARROW-10574
> URL: https://issues.apache.org/jira/browse/ARROW-10574
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weiyang Zhao
>Assignee: Weiyang Zhao
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> I would like to enhance partition filters in methods such as:
> {{pyarrow.parquet.ParquetDataset(path, filters)}}
> I am proposing the below enhancements:
>  # for operator "in", "not in", the value should be any typing.Iteratable 
> (also a container). But currently only set is supported while other 
> iteratable, such as list, tuple cannot function correctly. I would like to 
> change it to accept any iteratable.
>  # Enhance the documents about the partition filters.
> I see there is a new version implemented with 
>  _ParquetDatasetV2 which already accepts an iterable. So the documentation 
> update is fine for the new version as well.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10707) [Python][Parquet] Enhance hive partition filtering with 'like' operator

2020-11-23 Thread Weiyang Zhao (Jira)
Weiyang Zhao created ARROW-10707:


 Summary: [Python][Parquet] Enhance hive partition filtering with 
'like' operator
 Key: ARROW-10707
 URL: https://issues.apache.org/jira/browse/ARROW-10707
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Weiyang Zhao
Assignee: Weiyang Zhao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10707) [Python][Parquet] Enhance hive partition filtering with 'like' operator

2020-11-23 Thread Weiyang Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiyang Zhao updated ARROW-10707:
-
Description: Add a 'like' operator which has a semantics of a sql like. 
Alternatively, a regular expression can be used. I prefer sql like semantics 
for reasons to achieve sql consistency. 

> [Python][Parquet] Enhance hive partition filtering with 'like' operator
> ---
>
> Key: ARROW-10707
> URL: https://issues.apache.org/jira/browse/ARROW-10707
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Weiyang Zhao
>Assignee: Weiyang Zhao
>Priority: Major
>
> Add a 'like' operator which has a semantics of a sql like. Alternatively, a 
> regular expression can be used. I prefer sql like semantics for reasons to 
> achieve sql consistency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10706) [Python][Parquet] when filters end up with no partition, it will throw index out of range error.

2020-11-23 Thread Weiyang Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiyang Zhao updated ARROW-10706:
-
Description: 
The below code will raise IndexError:

{{dataset = pq.ParquetDataset(}}
base_path, filesystem=fs,
filters=[('string', '=', "notExisted")],
use_legacy_dataset=True
 {{)}}

when the partition 'string' does not have a matching partition value 
'notExisted'.

  was:
The below code will raise IndexError:

{{dataset = pq.ParquetDataset(}}
{{ base_path, filesystem=fs,}}
{{ filters=[('string', '=', "notExisted")],}}
{{ use_legacy_dataset=True}}
{{)}}


> [Python][Parquet] when filters end up with no partition, it will throw index 
> out of range error.
> 
>
> Key: ARROW-10706
> URL: https://issues.apache.org/jira/browse/ARROW-10706
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Weiyang Zhao
>Assignee: Weiyang Zhao
>Priority: Major
>
> The below code will raise IndexError:
> {{dataset = pq.ParquetDataset(}}
> base_path, filesystem=fs,
> filters=[('string', '=', "notExisted")],
> use_legacy_dataset=True
>  {{)}}
> when the partition 'string' does not have a matching partition value 
> 'notExisted'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10699) [C++] BitmapUInt64Reader doesn't work on big-endian

2020-11-23 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237770#comment-17237770
 ] 

Kazuaki Ishizaki commented on ARROW-10699:
--

[~apitrou] Thank you for fixing this. I missed this.

> [C++] BitmapUInt64Reader doesn't work on big-endian
> ---
>
> Key: ARROW-10699
> URL: https://issues.apache.org/jira/browse/ARROW-10699
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I didn't notice this when merging ARROW-10655 (the s390x CI is allowed to 
> fail).
> https://travis-ci.com/github/apache/arrow/jobs/445803711#L3534



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10708) [Packaging][deb] Add support for Ubuntu 20.10

2020-11-23 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-10708:


 Summary: [Packaging][deb] Add support for Ubuntu 20.10
 Key: ARROW-10708
 URL: https://issues.apache.org/jira/browse/ARROW-10708
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10708) [Packaging][deb] Add support for Ubuntu 20.10

2020-11-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10708:
---
Labels: pull-request-available  (was: )

> [Packaging][deb] Add support for Ubuntu 20.10
> -
>
> Key: ARROW-10708
> URL: https://issues.apache.org/jira/browse/ARROW-10708
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10665) [Rust] Add fast paths for common utf8 like patterns

2020-11-23 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão resolved ARROW-10665.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 8723
[https://github.com/apache/arrow/pull/8723]

> [Rust] Add fast paths for common utf8 like patterns
> ---
>
> Key: ARROW-10665
> URL: https://issues.apache.org/jira/browse/ARROW-10665
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Daniël Heres
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> patterns '%xxx'  'xxx%' and '%xxx' can use faster methods from rust standard 
> lib instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10708) [Packaging][deb] Add support for Ubuntu 20.10

2020-11-23 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-10708.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 8753
[https://github.com/apache/arrow/pull/8753]

> [Packaging][deb] Add support for Ubuntu 20.10
> -
>
> Key: ARROW-10708
> URL: https://issues.apache.org/jira/browse/ARROW-10708
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-9694) [Ruby] can't install red-arrow-gsl

2020-11-23 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou closed ARROW-9694.
---
Resolution: Not A Problem

> [Ruby] can't install red-arrow-gsl
> --
>
> Key: ARROW-9694
> URL: https://issues.apache.org/jira/browse/ARROW-9694
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 1.0.0
> Environment: windows, msys2, 
>Reporter: Dominic Sisneros
>Assignee: Kouhei Sutou
>Priority: Major
> Attachments: gem_make.out
>
>
> {noformat}
> f:\programming\source\repos\ruby\try_arrow>gem install red-arrow-gsl
> Temporarily enhancing PATH for MSYS/MINGW...
> Building native extensions. This could take a while...
> ERROR:  Error installing red-arrow-gsl:
> ERROR: Failed to build gem native extension.
> current directory: 
> F:/windows/scoop/persist/ruby/gems/gems/gsl-2.1.0.3/ext/gsl_native
> F:/windows/scoop/apps/ruby/2.7.1-1/bin/ruby.exe -I 
> F:/windows/scoop/apps/ruby/2.7.1-1/lib/ruby/site_ruby/2.7.0 -r 
> ./siteconf20200811-28480-149f31i.rb extconf.rb
> sh: gsl-config: No such file or directory
> *** ERROR: missing required library to compile this module: undefined method 
> `chomp' for nil:NilClass
> *** extconf.rb failed ***
> Could not create Makefile due to some reason, probably lack of necessary
> libraries and/or headers.  Check the mkmf.log file for more details.  You may
> need configuration options.
> Provided configuration options:
> --with-opt-dir
> --without-opt-dir
> --with-opt-include
> --without-opt-include=${opt-dir}/include
> --with-opt-lib
> --without-opt-lib=${opt-dir}/lib
> --with-make-prog
> --without-make-prog
> --srcdir=.
> --curdir
> --ruby=F:/windows/scoop/apps/ruby/2.7.1-1/bin/$(RUBY_BASE_NAME)
> --with-gsl-version
> extconf failed, exit code 1
> Gem files will remain installed in 
> F:/windows/scoop/persist/ruby/gems/gems/gsl-2.1.0.3 for inspection.
> Results logged to 
> F:/windows/scoop/persist/ruby/gems/extensions/x64-mingw32/2.7.0/gsl-2.1.0.3/gem_make.out
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9694) [Ruby] can't install red-arrow-gsl

2020-11-23 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-9694:

Environment: windows, msys2  (was: windows, msys2, )

> [Ruby] can't install red-arrow-gsl
> --
>
> Key: ARROW-9694
> URL: https://issues.apache.org/jira/browse/ARROW-9694
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 1.0.0
> Environment: windows, msys2
>Reporter: Dominic Sisneros
>Assignee: Kouhei Sutou
>Priority: Major
> Attachments: gem_make.out
>
>
> {noformat}
> f:\programming\source\repos\ruby\try_arrow>gem install red-arrow-gsl
> Temporarily enhancing PATH for MSYS/MINGW...
> Building native extensions. This could take a while...
> ERROR:  Error installing red-arrow-gsl:
> ERROR: Failed to build gem native extension.
> current directory: 
> F:/windows/scoop/persist/ruby/gems/gems/gsl-2.1.0.3/ext/gsl_native
> F:/windows/scoop/apps/ruby/2.7.1-1/bin/ruby.exe -I 
> F:/windows/scoop/apps/ruby/2.7.1-1/lib/ruby/site_ruby/2.7.0 -r 
> ./siteconf20200811-28480-149f31i.rb extconf.rb
> sh: gsl-config: No such file or directory
> *** ERROR: missing required library to compile this module: undefined method 
> `chomp' for nil:NilClass
> *** extconf.rb failed ***
> Could not create Makefile due to some reason, probably lack of necessary
> libraries and/or headers.  Check the mkmf.log file for more details.  You may
> need configuration options.
> Provided configuration options:
> --with-opt-dir
> --without-opt-dir
> --with-opt-include
> --without-opt-include=${opt-dir}/include
> --with-opt-lib
> --without-opt-lib=${opt-dir}/lib
> --with-make-prog
> --without-make-prog
> --srcdir=.
> --curdir
> --ruby=F:/windows/scoop/apps/ruby/2.7.1-1/bin/$(RUBY_BASE_NAME)
> --with-gsl-version
> extconf failed, exit code 1
> Gem files will remain installed in 
> F:/windows/scoop/persist/ruby/gems/gems/gsl-2.1.0.3 for inspection.
> Results logged to 
> F:/windows/scoop/persist/ruby/gems/extensions/x64-mingw32/2.7.0/gsl-2.1.0.3/gem_make.out
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-9694) [Ruby] can't install red-arrow-gsl

2020-11-23 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reopened ARROW-9694:
-

Sorry. This isn't resolved yet.

> [Ruby] can't install red-arrow-gsl
> --
>
> Key: ARROW-9694
> URL: https://issues.apache.org/jira/browse/ARROW-9694
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 1.0.0
> Environment: windows, msys2, 
>Reporter: Dominic Sisneros
>Assignee: Kouhei Sutou
>Priority: Major
> Attachments: gem_make.out
>
>
> {noformat}
> f:\programming\source\repos\ruby\try_arrow>gem install red-arrow-gsl
> Temporarily enhancing PATH for MSYS/MINGW...
> Building native extensions. This could take a while...
> ERROR:  Error installing red-arrow-gsl:
> ERROR: Failed to build gem native extension.
> current directory: 
> F:/windows/scoop/persist/ruby/gems/gems/gsl-2.1.0.3/ext/gsl_native
> F:/windows/scoop/apps/ruby/2.7.1-1/bin/ruby.exe -I 
> F:/windows/scoop/apps/ruby/2.7.1-1/lib/ruby/site_ruby/2.7.0 -r 
> ./siteconf20200811-28480-149f31i.rb extconf.rb
> sh: gsl-config: No such file or directory
> *** ERROR: missing required library to compile this module: undefined method 
> `chomp' for nil:NilClass
> *** extconf.rb failed ***
> Could not create Makefile due to some reason, probably lack of necessary
> libraries and/or headers.  Check the mkmf.log file for more details.  You may
> need configuration options.
> Provided configuration options:
> --with-opt-dir
> --without-opt-dir
> --with-opt-include
> --without-opt-include=${opt-dir}/include
> --with-opt-lib
> --without-opt-lib=${opt-dir}/lib
> --with-make-prog
> --without-make-prog
> --srcdir=.
> --curdir
> --ruby=F:/windows/scoop/apps/ruby/2.7.1-1/bin/$(RUBY_BASE_NAME)
> --with-gsl-version
> extconf failed, exit code 1
> Gem files will remain installed in 
> F:/windows/scoop/persist/ruby/gems/gems/gsl-2.1.0.3 for inspection.
> Results logged to 
> F:/windows/scoop/persist/ruby/gems/extensions/x64-mingw32/2.7.0/gsl-2.1.0.3/gem_make.out
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)