[jira] [Created] (ARROW-14937) [Doc] The docs about building docs doesn't mention mounting the target directory
Alessandro Molina created ARROW-14937: - Summary: [Doc] The docs about building docs doesn't mention mounting the target directory Key: ARROW-14937 URL: https://issues.apache.org/jira/browse/ARROW-14937 Project: Apache Arrow Issue Type: Bug Components: Documentation Affects Versions: 6.0.1 Reporter: Alessandro Molina When building with docker ( [https://arrow.apache.org/docs/developers/documentation.html#building-with-docker] ) the output goes into the {{/build}} directory of the container. The documentation states that the output will be available in {{docs/_build/html}} but for that to be true you need to mount that local directory as {{/build}} to get the output available to your own system, or it will remain in the container and thus won't be accessible. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14938) Partition column dissappear when reading dataset
Martin Gran created ARROW-14938: --- Summary: Partition column dissappear when reading dataset Key: ARROW-14938 URL: https://issues.apache.org/jira/browse/ARROW-14938 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 6.0.1 Environment: Debian bullseye, python 3.9 Reporter: Martin Gran Appending CSV to parquet dataset with partitioning on "code". {code:python} table = pa.Table.from_pandas(chunk) pa.dataset.write_dataset( table, output_path, basename_template=f"chunk_\{y}_\{{i}}", format="parquet", partitioning=["code"], existing_data_behavior="overwrite_or_ignore", ) {code} Loading the dataset again and expecting code to be in the dataframe. {code:python} import pyarrow.dataset as ds dataset = ds.dataset("../data/interim/2020_elements_parquet/", format="parquet",) df = dataset.to_table().to_pandas() >>>df["code"] {code} Trace {code:python} --- KeyError Traceback (most recent call last) ~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3360 try: -> 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: ~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() ~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'code' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) /tmp/ipykernel_24875/4149106129.py in > 1 df["code"] ~/.local/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key) 3456 if self.columns.nlevels > 1: 3457 return self._getitem_multilevel(key) -> 3458 indexer = self.columns.get_loc(key) 3459 if is_integer(indexer): 3460 indexer = [indexer] ~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: -> 3363 raise KeyError(key) from err 3364 3365 if is_scalar(key) and isna(key) and not self.hasnans: KeyError: 'code' {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14939) [R] Problem with new variables in dataset schema
Pal created ARROW-14939: --- Summary: [R] Problem with new variables in dataset schema Key: ARROW-14939 URL: https://issues.apache.org/jira/browse/ARROW-14939 Project: Apache Arrow Issue Type: Bug Affects Versions: 6.0.1 Environment: RStudio Version -- 1.4.1717 Session Information -- R version 4.1.0 (2021-05-18) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS 12.0.1 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_6.0.1 loaded via a namespace (and not attached): [1] tidyselect_1.1.1 bit_4.0.4compiler_4.1.0 magrittr_2.0.1 assertthat_0.2.1 R6_2.5.1 [7] tools_4.1.0 glue_1.5.0 bit64_4.0.5 vctrs_0.3.8 rlang_0.4.12 purrr_0.3.4 System Information -- sysname: Darwin release: 21.1.0 version: Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; root:xnu-8019.41.5~1/RELEASE_X86_64 nodename : machine: x86_64 login : root user : os effective_user : os Platform Information -- OS.type: unix file.sep : / dynlib.ext : .so GUI: RStudio endian : little pkgType: mac.binary path.sep : : r_arch : Reporter: Pal Hi, I have a problem with updating the schema in arrow::open_dataset(). For example, let's say I have one parquet file with two columns (a and b) and another file with three columns (a and b and c). When I open this dataset, its schema will only detect columns a and b. Am I missing something ? From my previous experience, I already added new columns to some Parquet files which did not exist in other files and the new columns were automatically added to my schema, which was great. Hereafter you will find the code to replicate my issue : {code:java} df = data.frame(a= 1, b= 2) df_2 = data.frame(a = 2, b = 3, c = 4) write_parquet(df, "C:/Data/test2/df1.parquet") write_parquet(df_2, "C:/Data/test2/df2.parquet") ds <- arrow::open_dataset(sources = "C:/Data/test2") ; ds_cols <- data.frame(variables = ds$ schema$ names) ds {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14940) [C++] Speed up CSV parser with long CSV cells
Antoine Pitrou created ARROW-14940: -- Summary: [C++] Speed up CSV parser with long CSV cells Key: ARROW-14940 URL: https://issues.apache.org/jira/browse/ARROW-14940 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Some CSV files may have long cells (values), for example if containing arbitrary texts or even things like timestamps. We can speed up parsing such CSV files by filtering multiple bytes at once for state-changing characters such as delimiters. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14941) [R] Implement Duration R6 class
Nicola Crane created ARROW-14941: Summary: [R] Implement Duration R6 class Key: ARROW-14941 URL: https://issues.apache.org/jira/browse/ARROW-14941 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14942) [R] Bindings for lubridate's dpicoseconds, dnanoseconds, desconds, dmilliseconds, dmicroseconds
Nicola Crane created ARROW-14942: Summary: [R] Bindings for lubridate's dpicoseconds, dnanoseconds, desconds, dmilliseconds, dmicroseconds Key: ARROW-14942 URL: https://issues.apache.org/jira/browse/ARROW-14942 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14943) [R] Bindings for lubridate's ddays, dhours, dminutes, dmonths, dweeks, dyears
Nicola Crane created ARROW-14943: Summary: [R] Bindings for lubridate's ddays, dhours, dminutes, dmonths, dweeks, dyears Key: ARROW-14943 URL: https://issues.apache.org/jira/browse/ARROW-14943 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14944) [R] Implement lubridate::make_difftime
Nicola Crane created ARROW-14944: Summary: [R] Implement lubridate::make_difftime Key: ARROW-14944 URL: https://issues.apache.org/jira/browse/ARROW-14944 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14945) [R] Implement lubridate functions for doing maths dates
Nicola Crane created ARROW-14945: Summary: [R] Implement lubridate functions for doing maths dates Key: ARROW-14945 URL: https://issues.apache.org/jira/browse/ARROW-14945 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14946) [C++][Python] An operator for finding indices of a value
Niranda Perera created ARROW-14946: -- Summary: [C++][Python] An operator for finding indices of a value Key: ARROW-14946 URL: https://issues.apache.org/jira/browse/ARROW-14946 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Reporter: Niranda Perera As discussed in this mail thread [1], it would be nice to have a search operator returning the indices of a Value. ex: {code:java} values = pa.array([1, 2, 2, 3, 4, 1]) indices = find_indices(values, 1) # expected = [0, 5]{code} currently there is an option to get the "first index" of a value using aggregates.index method. This would be a binary vector kernel IMO. This is somewhat similar to `numpy.where` [2] but without a `y` input. [1] [https://lists.apache.org/thread/o8d4m905fxswcg0qjjx7gj3ql2d582k4] [2] https://numpy.org/doc/stable/reference/generated/numpy.where.html -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14947) [C++] Implement maths with timestamps?
Nicola Crane created ARROW-14947: Summary: [C++] Implement maths with timestamps? Key: ARROW-14947 URL: https://issues.apache.org/jira/browse/ARROW-14947 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Nicola Crane Please could we have maths with timestamps implemented? In order to implement some of the functionality I'd like in R, I need to be able to do maths with dates. For example: * Addition and subtraction: Timestamp + Duration = Timestamp (with and without rollback so have ability to do e.g. 2021-03-30 minus 1 month and either get a null back, or 2021-02-28), plus the ability to specify whether to rollback to the first or last, and whether to preserve or rest the time. See https://lubridate.tidyverse.org/reference/mplus.html for documentation of the R functionality. * Multiplying Durations: Duration * Numeric = Duration -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14948) [R] Implement lubridate's %m+%, %m-%, add_with_rollback, and addition and subtraction with timestamp
Nicola Crane created ARROW-14948: Summary: [R] Implement lubridate's %m+%, %m-%, add_with_rollback, and addition and subtraction with timestamp Key: ARROW-14948 URL: https://issues.apache.org/jira/browse/ARROW-14948 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14949) [R] Implement lubridate interval functions
Nicola Crane created ARROW-14949: Summary: [R] Implement lubridate interval functions Key: ARROW-14949 URL: https://issues.apache.org/jira/browse/ARROW-14949 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14950) [R] Implement lubridate's interval type and lubridate::interval()
Nicola Crane created ARROW-14950: Summary: [R] Implement lubridate's interval type and lubridate::interval() Key: ARROW-14950 URL: https://issues.apache.org/jira/browse/ARROW-14950 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane In lubridate, an interval is a pair of dates. We could implement this as an extension type. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14951) [R] Implement lubridate %within%
Nicola Crane created ARROW-14951: Summary: [R] Implement lubridate %within% Key: ARROW-14951 URL: https://issues.apache.org/jira/browse/ARROW-14951 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14952) [R] Implement lubridate int_start, int_end, int_length
Nicola Crane created ARROW-14952: Summary: [R] Implement lubridate int_start, int_end, int_length Key: ARROW-14952 URL: https://issues.apache.org/jira/browse/ARROW-14952 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14953) [R] Implement lubridate int_flip and int_shift
Nicola Crane created ARROW-14953: Summary: [R] Implement lubridate int_flip and int_shift Key: ARROW-14953 URL: https://issues.apache.org/jira/browse/ARROW-14953 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14954) [R] Implement lubridate's int_aligns and int_overlaps
Nicola Crane created ARROW-14954: Summary: [R] Implement lubridate's int_aligns and int_overlaps Key: ARROW-14954 URL: https://issues.apache.org/jira/browse/ARROW-14954 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14955) [R] Implement lubridate's int_diff
Nicola Crane created ARROW-14955: Summary: [R] Implement lubridate's int_diff Key: ARROW-14955 URL: https://issues.apache.org/jira/browse/ARROW-14955 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14956) [R] Implement lubirdate's int_standardize
Nicola Crane created ARROW-14956: Summary: [R] Implement lubirdate's int_standardize Key: ARROW-14956 URL: https://issues.apache.org/jira/browse/ARROW-14956 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14957) [C++] Update OpenTelemetry to v1.1.0
David Li created ARROW-14957: Summary: [C++] Update OpenTelemetry to v1.1.0 Key: ARROW-14957 URL: https://issues.apache.org/jira/browse/ARROW-14957 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: David Li Assignee: David Li Changelog is fairly minor: [https://github.com/open-telemetry/opentelemetry-cpp/releases/tag/v1.1.0] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14958) [C++] Enable OpenTelemetry with Arrow Flight
David Li created ARROW-14958: Summary: [C++] Enable OpenTelemetry with Arrow Flight Key: ARROW-14958 URL: https://issues.apache.org/jira/browse/ARROW-14958 Project: Apache Arrow Issue Type: Improvement Reporter: David Li Sans Python support, at least for now, since figuring out how to do the bindings will be a challenge there. Also see [https://github.com/open-telemetry/community/discussions/734] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14959) Reading Hive-style partitioned parquet files from GCS
Garrett Weaver created ARROW-14959: -- Summary: Reading Hive-style partitioned parquet files from GCS Key: ARROW-14959 URL: https://issues.apache.org/jira/browse/ARROW-14959 Project: Apache Arrow Issue Type: Bug Affects Versions: 6.0.1 Reporter: Garrett Weaver Trying to read a spark-generated hive-style partitioned parquet dataset with `gcsfs` and `pyarrow`, but getting a `FileNotFoundError` if I try to read from the base directory or even if try to read directly from one of the partitions. Not sure if I am doing something wrong or it is not supported. Note that I have successfully read this hive-style partitioned parquet dataset using other methods to rule out any other issues, including: * Successful read with pyspark by using spark.read.parquet * Successful read of a specific partition by passing list of paths to `ParquetDataset` * Also, tested reading another spark-generated parquet dataset with no Hive-style partitions from GCS and that worked as well. Below is what I am trying: ``` import gcsfs import pyarrow as pa gcs = gcsfs.GCSFileSystem() partitions = pa.dataset.partitioning( pa.schema([("partition_var", pa.string())]), dictionaries="infer", flavor="hive" ) partitioned_dataset = pa.dataset.dataset( "path/to/partitioned/dataset/base/dir", filesystem=gcs, format="parquet", partitioning=partitions, ) partition_of_dataset = pa.dataset.dataset( "path/to/partitioned/dataset/base/dir/partition_var=some_value", filesystem=gcs, format="parquet", ) ``` The errors returned for both are below: ``` FileNotFoundError: path/to/partitioned/dataset/base/dir/ FileNotFoundError: path/to/partitioned/dataset/base/dir/partition_var=some_value/ ``` -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14960) [C++] Google style guide allows mutable references now, what do?
Ben Kietzman created ARROW-14960: Summary: [C++] Google style guide allows mutable references now, what do? Key: ARROW-14960 URL: https://issues.apache.org/jira/browse/ARROW-14960 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Ben Kietzman As of https://github.com/google/styleguide/commit/7a7a2f510efe7d7fc5ea8fbed549ddb31fac8f3e the Google Style Guide no longer forbids use of mutable references for output arguments, and actually encourages using them when the output argument is not optional. This puts arrow c++ style out of sync since we've continued to police toward usage of pointers for output arguments. We could: - keep the ban and note this as a deviation from google style in [development.rst|https://github.com/bkietz/arrow/blob/392af8aa999f940ab8fd61684820b2c6d89f7871/docs/source/developers/cpp/development.rst#L74-L75] - open JIRA(s) for updating pointer-output APIs where applicable -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14961) Bump version on Google Benchmark
Sasha Krassovsky created ARROW-14961: Summary: Bump version on Google Benchmark Key: ARROW-14961 URL: https://issues.apache.org/jira/browse/ARROW-14961 Project: Apache Arrow Issue Type: Bug Components: Benchmarking Reporter: Sasha Krassovsky Assignee: Sasha Krassovsky Google Benchmark v1.6.0 came out - I'd like to use a couple of functions it provides in a different issue. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14962) [CI] Fix minio installation on s390x
Krisztian Szucs created ARROW-14962: --- Summary: [CI] Fix minio installation on s390x Key: ARROW-14962 URL: https://issues.apache.org/jira/browse/ARROW-14962 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Krisztian Szucs Fix For: 7.0.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)