[jira] [Created] (ARROW-14937) [Doc] The docs about building docs doesn't mention mounting the target directory

2021-12-01 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-14937:
-

 Summary: [Doc] The docs about building docs doesn't mention 
mounting the target directory
 Key: ARROW-14937
 URL: https://issues.apache.org/jira/browse/ARROW-14937
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation
Affects Versions: 6.0.1
Reporter: Alessandro Molina


When building with docker ( 
[https://arrow.apache.org/docs/developers/documentation.html#building-with-docker]
 ) the output goes into the {{/build}} directory of the container. The 
documentation states that the output will be available in {{docs/_build/html}} 
but for that to be true you need to mount that local directory as {{/build}} to 
get the output available to your own system, or it will remain in the container 
and thus won't be accessible.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14938) Partition column dissappear when reading dataset

2021-12-01 Thread Martin Gran (Jira)
Martin Gran created ARROW-14938:
---

 Summary: Partition column dissappear when reading dataset
 Key: ARROW-14938
 URL: https://issues.apache.org/jira/browse/ARROW-14938
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 6.0.1
 Environment: Debian bullseye, python 3.9
Reporter: Martin Gran


Appending CSV to parquet dataset with partitioning on "code".
{code:python}
table = pa.Table.from_pandas(chunk)
        pa.dataset.write_dataset(
            table,
            output_path,
            basename_template=f"chunk_\{y}_\{{i}}",
            format="parquet",
            partitioning=["code"],
            existing_data_behavior="overwrite_or_ignore",
        )
{code}
Loading the dataset again and expecting code to be in the dataframe.
{code:python}
import pyarrow.dataset as ds
dataset = ds.dataset("../data/interim/2020_elements_parquet/", 
format="parquet",)
df = dataset.to_table().to_pandas()

>>>df["code"]
{code}
Trace
{code:python}
--- 
KeyError Traceback (most recent call last) 
~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in 
get_loc(self, key, method, tolerance)  3360 try: -> 3361 return 
self._engine.get_loc(casted_key)  3362 except KeyError as err: 
~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in 
pandas._libs.index.IndexEngine.get_loc() 
~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in 
pandas._libs.index.IndexEngine.get_loc() 
pandas/_libs/hashtable_class_helper.pxi in 
pandas._libs.hashtable.PyObjectHashTable.get_item() 
pandas/_libs/hashtable_class_helper.pxi in 
pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'code' The above 
exception was the direct cause of the following exception: KeyError Traceback 
(most recent call last) /tmp/ipykernel_24875/4149106129.py in  > 1 
df["code"] ~/.local/lib/python3.9/site-packages/pandas/core/frame.py in 
__getitem__(self, key)  3456 if self.columns.nlevels > 1:  3457 return 
self._getitem_multilevel(key) -> 3458 indexer = self.columns.get_loc(key)  3459 
if is_integer(indexer):  3460 indexer = [indexer] 
~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in 
get_loc(self, key, method, tolerance)  3361 return 
self._engine.get_loc(casted_key)  3362 except KeyError as err: -> 3363 raise 
KeyError(key) from err  3364  3365 if is_scalar(key) and isna(key) and not 
self.hasnans: KeyError: 'code'
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14939) [R] Problem with new variables in dataset schema

2021-12-01 Thread Pal (Jira)
Pal created ARROW-14939:
---

 Summary: [R] Problem with new variables in dataset schema
 Key: ARROW-14939
 URL: https://issues.apache.org/jira/browse/ARROW-14939
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 6.0.1
 Environment: 
RStudio Version
--
1.4.1717


Session Information
--
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS 12.0.1

Matrix products: default
LAPACK: 
/Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
[1] arrow_6.0.1

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.1 bit_4.0.4compiler_4.1.0   magrittr_2.0.1   
assertthat_0.2.1 R6_2.5.1
 [7] tools_4.1.0  glue_1.5.0   bit64_4.0.5  vctrs_0.3.8  
rlang_0.4.12 purrr_0.3.4 


System Information
--
sysname: Darwin 

release: 21.1.0 

version: Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; 
root:xnu-8019.41.5~1/RELEASE_X86_64
nodename   :

machine: x86_64 

login  : root   

user   : os 

effective_user : os 



Platform Information
--
OS.type: unix
file.sep   : /
dynlib.ext : .so
GUI: RStudio
endian : little
pkgType: mac.binary
path.sep   : :
r_arch : 
Reporter: Pal


Hi, 

I have a problem with updating the schema in arrow::open_dataset().

For example, let's say I have one parquet file with two columns (a and b) and 
another file with three columns (a and b and c). When I open this dataset, its 
schema will only detect columns a and b. Am I missing something ? From my 
previous experience, I already added new columns to some Parquet files which 
did not exist in other files and the new columns were automatically added to my 
schema, which was great.

Hereafter you will find the code to replicate my issue :

 
{code:java}
df = data.frame(a= 1,
    b= 2)
 df_2 = data.frame(a = 2,
  b = 3,
  c = 4)
write_parquet(df, "C:/Data/test2/df1.parquet")
write_parquet(df_2, "C:/Data/test2/df2.parquet")
ds <- arrow::open_dataset(sources = "C:/Data/test2") ; ds_cols <- 
data.frame(variables = ds$ schema$ names)
ds
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14940) [C++] Speed up CSV parser with long CSV cells

2021-12-01 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-14940:
--

 Summary: [C++] Speed up CSV parser with long CSV cells
 Key: ARROW-14940
 URL: https://issues.apache.org/jira/browse/ARROW-14940
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Some CSV files may have long cells (values), for example if containing 
arbitrary texts or even things like timestamps. We can speed up parsing such 
CSV files by filtering multiple bytes at once for state-changing characters 
such as delimiters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14941) [R] Implement Duration R6 class

2021-12-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14941:


 Summary: [R] Implement Duration R6 class
 Key: ARROW-14941
 URL: https://issues.apache.org/jira/browse/ARROW-14941
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14942) [R] Bindings for lubridate's dpicoseconds, dnanoseconds, desconds, dmilliseconds, dmicroseconds

2021-12-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14942:


 Summary: [R] Bindings for lubridate's dpicoseconds, dnanoseconds, 
desconds, dmilliseconds, dmicroseconds
 Key: ARROW-14942
 URL: https://issues.apache.org/jira/browse/ARROW-14942
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14943) [R] Bindings for lubridate's ddays, dhours, dminutes, dmonths, dweeks, dyears

2021-12-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14943:


 Summary: [R] Bindings for lubridate's ddays, dhours, dminutes, 
dmonths, dweeks, dyears
 Key: ARROW-14943
 URL: https://issues.apache.org/jira/browse/ARROW-14943
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14944) [R] Implement lubridate::make_difftime

2021-12-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14944:


 Summary: [R] Implement lubridate::make_difftime
 Key: ARROW-14944
 URL: https://issues.apache.org/jira/browse/ARROW-14944
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14945) [R] Implement lubridate functions for doing maths dates

2021-12-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14945:


 Summary: [R] Implement lubridate functions for doing maths dates
 Key: ARROW-14945
 URL: https://issues.apache.org/jira/browse/ARROW-14945
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14946) [C++][Python] An operator for finding indices of a value

2021-12-01 Thread Niranda Perera (Jira)
Niranda Perera created ARROW-14946:
--

 Summary: [C++][Python] An operator for finding indices of a value 
 Key: ARROW-14946
 URL: https://issues.apache.org/jira/browse/ARROW-14946
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Reporter: Niranda Perera


As discussed in this mail thread [1], it would be nice to have a search 
operator returning the indices of a Value. 

ex:
{code:java}
values = pa.array([1, 2, 2, 3, 4, 1])
indices = find_indices(values, 1) #  expected = [0, 5]{code}
currently there is an option to get the "first index" of a value using 
aggregates.index method. This would be a binary vector kernel IMO. 


This is somewhat similar to `numpy.where` [2] but without a `y` input. 

 

[1] [https://lists.apache.org/thread/o8d4m905fxswcg0qjjx7gj3ql2d582k4]


[2] https://numpy.org/doc/stable/reference/generated/numpy.where.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14947) [C++] Implement maths with timestamps?

2021-12-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14947:


 Summary: [C++] Implement maths with timestamps?
 Key: ARROW-14947
 URL: https://issues.apache.org/jira/browse/ARROW-14947
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Nicola Crane


Please could we have maths with timestamps implemented?

In order to implement some of the functionality I'd like in R, I need to be 
able to do maths with dates.  For example:

* Addition and subtraction: Timestamp + Duration = Timestamp (with and without 
rollback so have ability to do e.g. 2021-03-30 minus 1 month and either get a 
null back, or 2021-02-28), plus the ability to specify whether to rollback to 
the first or last, and whether to preserve or rest the time.

See https://lubridate.tidyverse.org/reference/mplus.html for documentation of 
the R functionality.

* Multiplying Durations: Duration * Numeric = Duration



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14948) [R] Implement lubridate's %m+%, %m-%, add_with_rollback, and addition and subtraction with timestamp

2021-12-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14948:


 Summary: [R] Implement lubridate's %m+%, %m-%, add_with_rollback, 
and addition and subtraction with timestamp
 Key: ARROW-14948
 URL: https://issues.apache.org/jira/browse/ARROW-14948
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14949) [R] Implement lubridate interval functions

2021-12-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14949:


 Summary: [R] Implement lubridate interval functions
 Key: ARROW-14949
 URL: https://issues.apache.org/jira/browse/ARROW-14949
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14950) [R] Implement lubridate's interval type and lubridate::interval()

2021-12-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14950:


 Summary: [R] Implement lubridate's interval type and 
lubridate::interval()
 Key: ARROW-14950
 URL: https://issues.apache.org/jira/browse/ARROW-14950
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane


In lubridate, an interval is a pair of dates.  We could implement this as an 
extension type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14951) [R] Implement lubridate %within%

2021-12-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14951:


 Summary: [R] Implement lubridate %within%
 Key: ARROW-14951
 URL: https://issues.apache.org/jira/browse/ARROW-14951
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14952) [R] Implement lubridate int_start, int_end, int_length

2021-12-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14952:


 Summary: [R] Implement lubridate int_start, int_end, int_length
 Key: ARROW-14952
 URL: https://issues.apache.org/jira/browse/ARROW-14952
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14953) [R] Implement lubridate int_flip and int_shift

2021-12-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14953:


 Summary: [R] Implement lubridate int_flip and int_shift
 Key: ARROW-14953
 URL: https://issues.apache.org/jira/browse/ARROW-14953
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14954) [R] Implement lubridate's int_aligns and int_overlaps

2021-12-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14954:


 Summary: [R] Implement lubridate's int_aligns and int_overlaps
 Key: ARROW-14954
 URL: https://issues.apache.org/jira/browse/ARROW-14954
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14955) [R] Implement lubridate's int_diff

2021-12-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14955:


 Summary: [R] Implement lubridate's int_diff
 Key: ARROW-14955
 URL: https://issues.apache.org/jira/browse/ARROW-14955
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14956) [R] Implement lubirdate's int_standardize

2021-12-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14956:


 Summary: [R] Implement lubirdate's int_standardize
 Key: ARROW-14956
 URL: https://issues.apache.org/jira/browse/ARROW-14956
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14957) [C++] Update OpenTelemetry to v1.1.0

2021-12-01 Thread David Li (Jira)
David Li created ARROW-14957:


 Summary: [C++] Update OpenTelemetry to v1.1.0
 Key: ARROW-14957
 URL: https://issues.apache.org/jira/browse/ARROW-14957
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: David Li
Assignee: David Li


Changelog is fairly minor: 
[https://github.com/open-telemetry/opentelemetry-cpp/releases/tag/v1.1.0]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14958) [C++] Enable OpenTelemetry with Arrow Flight

2021-12-01 Thread David Li (Jira)
David Li created ARROW-14958:


 Summary: [C++] Enable OpenTelemetry with Arrow Flight
 Key: ARROW-14958
 URL: https://issues.apache.org/jira/browse/ARROW-14958
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: David Li


Sans Python support, at least for now, since figuring out how to do the 
bindings will be a challenge there. Also see 
[https://github.com/open-telemetry/community/discussions/734]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14959) Reading Hive-style partitioned parquet files from GCS

2021-12-01 Thread Garrett Weaver (Jira)
Garrett Weaver created ARROW-14959:
--

 Summary: Reading Hive-style partitioned parquet files from GCS
 Key: ARROW-14959
 URL: https://issues.apache.org/jira/browse/ARROW-14959
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 6.0.1
Reporter: Garrett Weaver


Trying to read a spark-generated hive-style partitioned parquet dataset with 
`gcsfs` and `pyarrow`, but getting a `FileNotFoundError` if I try to read from 
the base directory or even if try to read directly from one of the partitions. 
Not sure if I am doing something wrong or it is not supported.

Note that I have successfully read this hive-style partitioned parquet dataset 
using other methods to rule out any other issues, including:
 * Successful read with pyspark by using spark.read.parquet
 * Successful read of a specific partition by passing list of paths to 
`ParquetDataset`
 * Also, tested reading another spark-generated parquet dataset with no 
Hive-style partitions from GCS and that worked as well.

Below is what I am trying:

```

import gcsfs
import pyarrow as pa
 
gcs = gcsfs.GCSFileSystem()
 
partitions = pa.dataset.partitioning(
    pa.schema([("partition_var", pa.string())]), dictionaries="infer", 
flavor="hive"
)
 
partitioned_dataset = pa.dataset.dataset(
    "path/to/partitioned/dataset/base/dir",
    filesystem=gcs,
    format="parquet",
    partitioning=partitions,
)
 
partition_of_dataset = pa.dataset.dataset(
    "path/to/partitioned/dataset/base/dir/partition_var=some_value",
    filesystem=gcs,
    format="parquet",
)
```

 

The errors returned for both are below:



```

FileNotFoundError: path/to/partitioned/dataset/base/dir/

FileNotFoundError: 
path/to/partitioned/dataset/base/dir/partition_var=some_value/

```

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14960) [C++] Google style guide allows mutable references now, what do?

2021-12-01 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-14960:


 Summary: [C++] Google style guide allows mutable references now, 
what do?
 Key: ARROW-14960
 URL: https://issues.apache.org/jira/browse/ARROW-14960
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Ben Kietzman


As of 
https://github.com/google/styleguide/commit/7a7a2f510efe7d7fc5ea8fbed549ddb31fac8f3e
 the Google Style Guide no longer forbids use of mutable references for output 
arguments, and actually encourages using them when the output argument is not 
optional.

This puts arrow c++ style out of sync since we've continued to police toward 
usage of pointers for output arguments. We could:

- keep the ban and note this as a deviation from google style in 
[development.rst|https://github.com/bkietz/arrow/blob/392af8aa999f940ab8fd61684820b2c6d89f7871/docs/source/developers/cpp/development.rst#L74-L75]
- open JIRA(s) for updating pointer-output APIs where applicable



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14961) Bump version on Google Benchmark

2021-12-01 Thread Sasha Krassovsky (Jira)
Sasha Krassovsky created ARROW-14961:


 Summary: Bump version on Google Benchmark 
 Key: ARROW-14961
 URL: https://issues.apache.org/jira/browse/ARROW-14961
 Project: Apache Arrow
  Issue Type: Bug
  Components: Benchmarking
Reporter: Sasha Krassovsky
Assignee: Sasha Krassovsky


Google Benchmark v1.6.0 came out - I'd like to use a couple of functions it 
provides in a different issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14962) [CI] Fix minio installation on s390x

2021-12-01 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-14962:
---

 Summary: [CI] Fix minio installation on s390x
 Key: ARROW-14962
 URL: https://issues.apache.org/jira/browse/ARROW-14962
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Krisztian Szucs
 Fix For: 7.0.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)