[jira] [Resolved] (ARROW-14261) [C++] Includes should be in alphabetical order
[ https://issues.apache.org/jira/browse/ARROW-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yibo Cai resolved ARROW-14261. -- Fix Version/s: 6.0.0 Resolution: Fixed Issue resolved by pull request 11362 [https://github.com/apache/arrow/pull/11362] > [C++] Includes should be in alphabetical order > -- > > Key: ARROW-14261 > URL: https://issues.apache.org/jira/browse/ARROW-14261 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Benson Muite >Assignee: Benson Muite >Priority: Trivial > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Includes in > [https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/registry_internal.h] > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/registry.cc > should be in alphabetical order -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-14222) [C++] Create GcsFileSystem skeleton
[ https://issues.apache.org/jira/browse/ARROW-14222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-14222. -- Fix Version/s: 6.0.0 Resolution: Fixed Issue resolved by pull request 11331 [https://github.com/apache/arrow/pull/11331] > [C++] Create GcsFileSystem skeleton > --- > > Key: ARROW-14222 > URL: https://issues.apache.org/jira/browse/ARROW-14222 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Carlos O'Ryan >Assignee: Carlos O'Ryan >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 5h 20m > Remaining Estimate: 0h > > Implement a skeleton for GCSFileSystem. All functions would return > `Status::NotImplemented()`. This will keep the future changes smaller, and > allow me to verify all CI builds are working in a smaller PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14267) [Python] Cannot convert pd.DataFrame with geometry cells to pa.Table
[ https://issues.apache.org/jira/browse/ARROW-14267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-14267: - Summary: [Python] Cannot convert pd.DataFrame with geometry cells to pa.Table (was: Cannot convert pd.DataFrame with geometry cells to pa.Table) > [Python] Cannot convert pd.DataFrame with geometry cells to pa.Table > > > Key: ARROW-14267 > URL: https://issues.apache.org/jira/browse/ARROW-14267 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 5.0.0 >Reporter: Henrikh Kantuni >Priority: Minor > Labels: pyarrow > > Example: > {code:java} > import geopandas as gpd > import pandas as pd > import pyarrow as pa > path = gpd.datasets.get_path("naturalearth_lowres") > data = gpd.read_file(path) > df = pd.DataFrame(data) > table = pa.Table.from_pandas(df) > print(table) > {code} > Throws the following error: > {code:java} > Traceback (most recent call last): > File "/Users/Henrikh/Desktop/tmp.py", line 8, in > table = pa.Table.from_pandas(df) > File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 594, in dataframe_to_arrays > arrays = [convert_column(c, f) > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 594, in > arrays = [convert_column(c, f) > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 581, in convert_column > raise e > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 575, in convert_column > result = pa.array(col, type=type_, from_pandas=True, safe=safe) > File "pyarrow/array.pxi", line 302, in pyarrow.lib.array > File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array > File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type > File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status > pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion > failed for column geometry with type geometry'){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14267) Cannot convert pd.DataFrame with geometry cells to pa.Table
[ https://issues.apache.org/jira/browse/ARROW-14267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrikh Kantuni updated ARROW-14267: Summary: Cannot convert pd.DataFrame with geometry cells to pa.Table (was: Cannot convert DataFrame with geometry cells to Table) > Cannot convert pd.DataFrame with geometry cells to pa.Table > --- > > Key: ARROW-14267 > URL: https://issues.apache.org/jira/browse/ARROW-14267 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 5.0.0 >Reporter: Henrikh Kantuni >Priority: Minor > Labels: pyarrow > > Example: > {code:java} > import geopandas as gpd > import pandas as pd > import pyarrow as pa > path = gpd.datasets.get_path("naturalearth_lowres") > data = gpd.read_file(path) > df = pd.DataFrame(data) > table = pa.Table.from_pandas(df) > print(table) > {code} > Throws the following error: > {code:java} > Traceback (most recent call last): > File "/Users/Henrikh/Desktop/tmp.py", line 8, in > table = pa.Table.from_pandas(df) > File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 594, in dataframe_to_arrays > arrays = [convert_column(c, f) > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 594, in > arrays = [convert_column(c, f) > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 581, in convert_column > raise e > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 575, in convert_column > result = pa.array(col, type=type_, from_pandas=True, safe=safe) > File "pyarrow/array.pxi", line 302, in pyarrow.lib.array > File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array > File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type > File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status > pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion > failed for column geometry with type geometry'){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14268) Cannot convert DataFrame with complex128 cells to Table
Henrikh Kantuni created ARROW-14268: --- Summary: Cannot convert DataFrame with complex128 cells to Table Key: ARROW-14268 URL: https://issues.apache.org/jira/browse/ARROW-14268 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 5.0.0 Reporter: Henrikh Kantuni Example: {code:java} import numpy as np import pandas as pd import pyarrow as pa data = np.array([1 + 1j]) df = pd.DataFrame(data) table = pa.Table.from_pandas(df) print(table) {code} Throws the following error: {code:java} Traceback (most recent call last): File "/Users/Henrikh/Desktop/tmp.py", line 7, in table = pa.Table.from_pandas(df) File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 594, in dataframe_to_arrays arrays = [convert_column(c, f) File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 594, in arrays = [convert_column(c, f) File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 581, in convert_column raise e File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 575, in convert_column result = pa.array(col, type=type_, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 302, in pyarrow.lib.array File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type File "pyarrow/error.pxi", line 118, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: ('Unsupported numpy type 15', 'Conversion failed for column 0 with type complex128'){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14268) Cannot convert pd.DataFrame with complex128 cells to pa.Table
[ https://issues.apache.org/jira/browse/ARROW-14268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrikh Kantuni updated ARROW-14268: Summary: Cannot convert pd.DataFrame with complex128 cells to pa.Table (was: Cannot convert DataFrame with complex128 cells to Table) > Cannot convert pd.DataFrame with complex128 cells to pa.Table > - > > Key: ARROW-14268 > URL: https://issues.apache.org/jira/browse/ARROW-14268 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 5.0.0 >Reporter: Henrikh Kantuni >Priority: Minor > Labels: pyarrow > > Example: > > {code:java} > import numpy as np > import pandas as pd > import pyarrow as pa > data = np.array([1 + 1j]) > df = pd.DataFrame(data) > table = pa.Table.from_pandas(df) > print(table) > {code} > Throws the following error: > {code:java} > Traceback (most recent call last): > File "/Users/Henrikh/Desktop/tmp.py", line 7, in > table = pa.Table.from_pandas(df) > File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 594, in dataframe_to_arrays > arrays = [convert_column(c, f) > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 594, in > arrays = [convert_column(c, f) > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 581, in convert_column > raise e > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 575, in convert_column > result = pa.array(col, type=type_, from_pandas=True, safe=safe) > File "pyarrow/array.pxi", line 302, in pyarrow.lib.array > File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array > File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type > File "pyarrow/error.pxi", line 118, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: ('Unsupported numpy type 15', > 'Conversion failed for column 0 with type complex128'){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14267) Cannot convert DataFrame with geometry cells to Table
[ https://issues.apache.org/jira/browse/ARROW-14267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrikh Kantuni updated ARROW-14267: Summary: Cannot convert DataFrame with geometry cells to Table (was: Cannot convert DataFrame with geometry `numpy.dtype` cells to Table) > Cannot convert DataFrame with geometry cells to Table > - > > Key: ARROW-14267 > URL: https://issues.apache.org/jira/browse/ARROW-14267 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 5.0.0 >Reporter: Henrikh Kantuni >Priority: Minor > Labels: pyarrow > > Example: > {code:java} > import geopandas as gpd > import pandas as pd > import pyarrow as pa > path = gpd.datasets.get_path("naturalearth_lowres") > data = gpd.read_file(path) > df = pd.DataFrame(data) > table = pa.Table.from_pandas(df) > print(table) > {code} > Throws the following error: > {code:java} > Traceback (most recent call last): > File "/Users/Henrikh/Desktop/tmp.py", line 8, in > table = pa.Table.from_pandas(df) > File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 594, in dataframe_to_arrays > arrays = [convert_column(c, f) > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 594, in > arrays = [convert_column(c, f) > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 581, in convert_column > raise e > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 575, in convert_column > result = pa.array(col, type=type_, from_pandas=True, safe=safe) > File "pyarrow/array.pxi", line 302, in pyarrow.lib.array > File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array > File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type > File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status > pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion > failed for column geometry with type geometry'){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14267) Cannot convert DataFrame with geometry `numpy.dtype` cells to Table
[ https://issues.apache.org/jira/browse/ARROW-14267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrikh Kantuni updated ARROW-14267: Description: Example: {code:java} import geopandas as gpd import pandas as pd import pyarrow as pa path = gpd.datasets.get_path("naturalearth_lowres") data = gpd.read_file(path) df = pd.DataFrame(data) table = pa.Table.from_pandas(df) print(table) {code} Throws the following error: {code:java} Traceback (most recent call last): File "/Users/Henrikh/Desktop/tmp.py", line 8, in table = pa.Table.from_pandas(df) File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 594, in dataframe_to_arrays arrays = [convert_column(c, f) File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 594, in arrays = [convert_column(c, f) File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 581, in convert_column raise e File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 575, in convert_column result = pa.array(col, type=type_, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 302, in pyarrow.lib.array File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column geometry with type geometry'){code} was: Example: {code:java} import geopandas as gpd import pandas as pd import pyarrow as pa path = gpd.datasets.get_path("naturalearth_lowres") data = gpd.read_file(path) df = pd.DataFrame(data) table = pa.Table.from_pandas(df) print(table) {code} Throws the following error: {code:java} Traceback (most recent call last): File "/Users/Henrikh/Desktop/tmp.py", line 8, in table = pa.Table.from_pandas(df) File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 594, in dataframe_to_arrays arrays = [convert_column(c, f) File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 594, in arrays = [convert_column(c, f) File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 581, in convert_column raise e File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 575, in convert_column result = pa.array(col, type=type_, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 302, in pyarrow.lib.array File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column geometry with type geometry'){code} > Cannot convert DataFrame with geometry `numpy.dtype` cells to Table > --- > > Key: ARROW-14267 > URL: https://issues.apache.org/jira/browse/ARROW-14267 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 5.0.0 >Reporter: Henrikh Kantuni >Priority: Minor > Labels: pyarrow > > Example: > {code:java} > import geopandas as gpd > import pandas as pd > import pyarrow as pa > path = gpd.datasets.get_path("naturalearth_lowres") > data = gpd.read_file(path) > df = pd.DataFrame(data) > table = pa.Table.from_pandas(df) > print(table) > {code} > Throws the following error: > {code:java} > Traceback (most recent call last): > File "/Users/Henrikh/Desktop/tmp.py", line 8, in > table = pa.Table.from_pandas(df) > File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 594, in dataframe_to_arrays > arrays = [convert_column(c, f) > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 594, in > arrays = [convert_column(c, f) > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 581, in convert_column > raise e > File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line > 575, in convert_column > result = pa.array(col, type=type_, from_pandas=True, safe=safe) > File "pyarrow/array.pxi", line 302, in pyarrow.lib.array > File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array > File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type > File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status > pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion > failed for column geometry with type geometry'){code} > --
[jira] [Created] (ARROW-14267) Cannot convert DataFrame with geometry `numpy.dtype` cells to Table
Henrikh Kantuni created ARROW-14267: --- Summary: Cannot convert DataFrame with geometry `numpy.dtype` cells to Table Key: ARROW-14267 URL: https://issues.apache.org/jira/browse/ARROW-14267 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 5.0.0 Reporter: Henrikh Kantuni Example: {code:java} import geopandas as gpd import pandas as pd import pyarrow as pa path = gpd.datasets.get_path("naturalearth_lowres") data = gpd.read_file(path) df = pd.DataFrame(data) table = pa.Table.from_pandas(df) print(table) {code} Throws the following error: {code:java} Traceback (most recent call last): File "/Users/Henrikh/Desktop/tmp.py", line 8, in table = pa.Table.from_pandas(df) File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 594, in dataframe_to_arrays arrays = [convert_column(c, f) File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 594, in arrays = [convert_column(c, f) File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 581, in convert_column raise e File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 575, in convert_column result = pa.array(col, type=type_, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 302, in pyarrow.lib.array File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column geometry with type geometry'){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14266) [R] Use WriteNode to write queries
Neal Richardson created ARROW-14266: --- Summary: [R] Use WriteNode to write queries Key: ARROW-14266 URL: https://issues.apache.org/jira/browse/ARROW-14266 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Fix For: 7.0.0 Following ARROW-13542. Any query that has a join or an aggregation currently has to first evaluate the query and hold it in memory before creating a Scanner to write it. We could improve that by using a WriteNode inside write_dataset() (and maybe that improves the other cases too, or at least allows us to delete some code). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14265) [C++][Gandiva] Add support for LLVM 13
[ https://issues.apache.org/jira/browse/ARROW-14265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-14265: --- Labels: pull-request-available (was: ) > [C++][Gandiva] Add support for LLVM 13 > -- > > Key: ARROW-14265 > URL: https://issues.apache.org/jira/browse/ARROW-14265 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Gandiva >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14265) [C++][Gandiva] Add support for LLVM 13
[ https://issues.apache.org/jira/browse/ARROW-14265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-14265: - Summary: [C++][Gandiva] Add support for LLVM 13 (was: [C++][Gandiva] Support building with LLVM 13) > [C++][Gandiva] Add support for LLVM 13 > -- > > Key: ARROW-14265 > URL: https://issues.apache.org/jira/browse/ARROW-14265 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Gandiva >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14265) [C++][Gandiva] Support building with LLVM 13
Kouhei Sutou created ARROW-14265: Summary: [C++][Gandiva] Support building with LLVM 13 Key: ARROW-14265 URL: https://issues.apache.org/jira/browse/ARROW-14265 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14264) [R] Support inequality joins
Neal Richardson created ARROW-14264: --- Summary: [R] Support inequality joins Key: ARROW-14264 URL: https://issues.apache.org/jira/browse/ARROW-14264 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Fix For: 7.0.0 We'll need this not-yet-merged dplyr API to do it: https://github.com/tidyverse/dplyr/pull/5910 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14264) [R] Support inequality joins
[ https://issues.apache.org/jira/browse/ARROW-14264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-14264: Labels: query-engine (was: ) > [R] Support inequality joins > > > Key: ARROW-14264 > URL: https://issues.apache.org/jira/browse/ARROW-14264 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Priority: Major > Labels: query-engine > Fix For: 7.0.0 > > > We'll need this not-yet-merged dplyr API to do it: > https://github.com/tidyverse/dplyr/pull/5910 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's
[ https://issues.apache.org/jira/browse/ARROW-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Ashby updated ARROW-14263: Environment: pyarrow==5.0.0 C++ = 5.0.0 Windows 10 Pro x64 Python 3.8.5 was: pyarrow==5.0.0 C++ = 5.0.0 Windows 10 Pro x64 > "Invalid flatbuffers message" thrown with some serialized RecordBatch's > --- > > Key: ARROW-14263 > URL: https://issues.apache.org/jira/browse/ARROW-14263 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 5.0.0 > Environment: pyarrow==5.0.0 > C++ = 5.0.0 > Windows 10 Pro x64 > Python 3.8.5 >Reporter: Bryan Ashby >Priority: Major > Attachments: record-batch-large.arrow > > > I'm running into various exceptions (often: "Invalid flatbuffers message") > when attempting to de-serialize RecordBatch's in Python that were generated > in C++. > The same batch can be de-serialized back within C++. > *Example (C++)* (status checks omitted, but they are check in real code)*:* > {code:java} > const auto stream = arrow::io::BufferOutputStream::Create(); > { > const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); > const auto writeRes = (*writer)->WriteRecordBatch(batch); > (*writer)->Close(); > } > auto buffer = (*stream)->Finish(); > std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python > ofs.write(reinterpret_cast((*buffer)->data()), > (*buffer)->size()); > ofs.close(); > auto backAgain = DeserializeRecordBatch((*buffer)); // all good > {code} > *Then in Python*: > {code:java} > with open("record-batch-large.arrow", "rb") as f: > data = f.read() > reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid > flatbuffers message" > {code} > Please see the attached .arrow file (produced above). > Any ideas? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's
[ https://issues.apache.org/jira/browse/ARROW-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Ashby updated ARROW-14263: Description: I'm running into various exceptions (often: "Invalid flatbuffers message") when attempting to de-serialize RecordBatch's in Python that were generated in C++. The same batch can be de-serialized back within C++. *Example (C++)* (status checks omitted, but they are check in real code)*:* {code:java} const auto stream = arrow::io::BufferOutputStream::Create(); { const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); sdk::MaybeThrowError(writer); const auto writeRes = (*writer)->WriteRecordBatch(batch); sdk::MaybeThrowError((*writer)->Close()); } auto buffer = (*stream)->Finish();std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size()); ofs.close(); auto backAgain = DeserializeRecordBatch((*buffer)); // all good {code} *Then in Python*: {code:java} with open("record-batch-large.arrow", "rb") as f: data = f.read() reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers message" {code} Please see the attached .arrow file (produced above). Any ideas? was: I'm running into various exceptions (often: "Invalid flatbuffers message") when attempting to de-serialize RecordBatch's in Python that were generated in C++. The same batch can be de-serialized back within C++. *Example (C++)* (status checks omitted, but they are check in real code)*:* {code:java} const auto stream = arrow::io::BufferOutputStream::Create(); { const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); sdk::MaybeThrowError(writer); const auto writeRes = (*writer)->WriteRecordBatch(batch); sdk::MaybeThrowError((*writer)->Close()); } auto buffer = (*stream)->Finish();std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size()); ofs.close(); auto backAgain = DeserializeRecordBatch((*buffer)); // all good {code} *Then in Python*: {code:java} with open("record-batch-large.arrow", "rb") as f: data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers message" {code} Please see the attached .arrow file (produced above). Any ideas? > "Invalid flatbuffers message" thrown with some serialized RecordBatch's > --- > > Key: ARROW-14263 > URL: https://issues.apache.org/jira/browse/ARROW-14263 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 5.0.0 > Environment: pyarrow==5.0.0 > C++ = 5.0.0 > Windows 10 Pro x64 >Reporter: Bryan Ashby >Priority: Major > Attachments: record-batch-large.arrow > > > I'm running into various exceptions (often: "Invalid flatbuffers message") > when attempting to de-serialize RecordBatch's in Python that were generated > in C++. > The same batch can be de-serialized back within C++. > *Example (C++)* (status checks omitted, but they are check in real code)*:* > {code:java} > const auto stream = arrow::io::BufferOutputStream::Create(); > { > const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); > sdk::MaybeThrowError(writer); > const auto writeRes = (*writer)->WriteRecordBatch(batch); > sdk::MaybeThrowError((*writer)->Close()); > } > auto buffer = (*stream)->Finish();std::ofstream > ofs("record-batch-large.arrow"); // we'll read this in Python > ofs.write(reinterpret_cast((*buffer)->data()), > (*buffer)->size()); > ofs.close(); > auto backAgain = DeserializeRecordBatch((*buffer)); // all good > {code} > *Then in Python*: > {code:java} > with open("record-batch-large.arrow", "rb") as f: > data = f.read() > reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid > flatbuffers message" > {code} > Please see the attached .arrow file (produced above). > Any ideas? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's
[ https://issues.apache.org/jira/browse/ARROW-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Ashby updated ARROW-14263: Description: I'm running into various exceptions (often: "Invalid flatbuffers message") when attempting to de-serialize RecordBatch's in Python that were generated in C++. The same batch can be de-serialized back within C++. *Example (C++)* (status checks omitted, but they are check in real code)*:* {code:java} const auto stream = arrow::io::BufferOutputStream::Create(); { const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); const auto writeRes = (*writer)->WriteRecordBatch(batch); (*writer)->Close(); } auto buffer = (*stream)->Finish(); std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size()); ofs.close(); auto backAgain = DeserializeRecordBatch((*buffer)); // all good {code} *Then in Python*: {code:java} with open("record-batch-large.arrow", "rb") as f: data = f.read() reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers message" {code} Please see the attached .arrow file (produced above). Any ideas? was: I'm running into various exceptions (often: "Invalid flatbuffers message") when attempting to de-serialize RecordBatch's in Python that were generated in C++. The same batch can be de-serialized back within C++. *Example (C++)* (status checks omitted, but they are check in real code)*:* {code:java} const auto stream = arrow::io::BufferOutputStream::Create(); { const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); const auto writeRes = (*writer)->WriteRecordBatch(batch); sdk::MaybeThrowError((*writer)->Close()); } auto buffer = (*stream)->Finish(); std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size()); ofs.close(); auto backAgain = DeserializeRecordBatch((*buffer)); // all good {code} *Then in Python*: {code:java} with open("record-batch-large.arrow", "rb") as f: data = f.read() reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers message" {code} Please see the attached .arrow file (produced above). Any ideas? > "Invalid flatbuffers message" thrown with some serialized RecordBatch's > --- > > Key: ARROW-14263 > URL: https://issues.apache.org/jira/browse/ARROW-14263 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 5.0.0 > Environment: pyarrow==5.0.0 > C++ = 5.0.0 > Windows 10 Pro x64 >Reporter: Bryan Ashby >Priority: Major > Attachments: record-batch-large.arrow > > > I'm running into various exceptions (often: "Invalid flatbuffers message") > when attempting to de-serialize RecordBatch's in Python that were generated > in C++. > The same batch can be de-serialized back within C++. > *Example (C++)* (status checks omitted, but they are check in real code)*:* > {code:java} > const auto stream = arrow::io::BufferOutputStream::Create(); > { > const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); > const auto writeRes = (*writer)->WriteRecordBatch(batch); > (*writer)->Close(); > } > auto buffer = (*stream)->Finish(); > std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python > ofs.write(reinterpret_cast((*buffer)->data()), > (*buffer)->size()); > ofs.close(); > auto backAgain = DeserializeRecordBatch((*buffer)); // all good > {code} > *Then in Python*: > {code:java} > with open("record-batch-large.arrow", "rb") as f: > data = f.read() > reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid > flatbuffers message" > {code} > Please see the attached .arrow file (produced above). > Any ideas? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's
[ https://issues.apache.org/jira/browse/ARROW-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Ashby updated ARROW-14263: Description: I'm running into various exceptions (often: "Invalid flatbuffers message") when attempting to de-serialize RecordBatch's in Python that were generated in C++. The same batch can be de-serialized back within C++. *Example (C++)* (status checks omitted, but they are check in real code)*:* {code:java} const auto stream = arrow::io::BufferOutputStream::Create(); { const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); const auto writeRes = (*writer)->WriteRecordBatch(batch); sdk::MaybeThrowError((*writer)->Close()); } auto buffer = (*stream)->Finish(); std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size()); ofs.close(); auto backAgain = DeserializeRecordBatch((*buffer)); // all good {code} *Then in Python*: {code:java} with open("record-batch-large.arrow", "rb") as f: data = f.read() reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers message" {code} Please see the attached .arrow file (produced above). Any ideas? was: I'm running into various exceptions (often: "Invalid flatbuffers message") when attempting to de-serialize RecordBatch's in Python that were generated in C++. The same batch can be de-serialized back within C++. *Example (C++)* (status checks omitted, but they are check in real code)*:* {code:java} const auto stream = arrow::io::BufferOutputStream::Create(); { const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); sdk::MaybeThrowError(writer); const auto writeRes = (*writer)->WriteRecordBatch(batch); sdk::MaybeThrowError((*writer)->Close()); } auto buffer = (*stream)->Finish();std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size()); ofs.close(); auto backAgain = DeserializeRecordBatch((*buffer)); // all good {code} *Then in Python*: {code:java} with open("record-batch-large.arrow", "rb") as f: data = f.read() reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers message" {code} Please see the attached .arrow file (produced above). Any ideas? > "Invalid flatbuffers message" thrown with some serialized RecordBatch's > --- > > Key: ARROW-14263 > URL: https://issues.apache.org/jira/browse/ARROW-14263 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 5.0.0 > Environment: pyarrow==5.0.0 > C++ = 5.0.0 > Windows 10 Pro x64 >Reporter: Bryan Ashby >Priority: Major > Attachments: record-batch-large.arrow > > > I'm running into various exceptions (often: "Invalid flatbuffers message") > when attempting to de-serialize RecordBatch's in Python that were generated > in C++. > The same batch can be de-serialized back within C++. > *Example (C++)* (status checks omitted, but they are check in real code)*:* > {code:java} > const auto stream = arrow::io::BufferOutputStream::Create(); > { > const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); > const auto writeRes = (*writer)->WriteRecordBatch(batch); > sdk::MaybeThrowError((*writer)->Close()); > } > auto buffer = (*stream)->Finish(); > std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python > ofs.write(reinterpret_cast((*buffer)->data()), > (*buffer)->size()); > ofs.close(); > auto backAgain = DeserializeRecordBatch((*buffer)); // all good > {code} > *Then in Python*: > {code:java} > with open("record-batch-large.arrow", "rb") as f: > data = f.read() > reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid > flatbuffers message" > {code} > Please see the attached .arrow file (produced above). > Any ideas? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's
[ https://issues.apache.org/jira/browse/ARROW-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Ashby updated ARROW-14263: Description: I'm running into various exceptions (often: "Invalid flatbuffers message") when attempting to de-serialize RecordBatch's in Python that were generated in C++. The same batch can be de-serialized back within C++. *Example (C++)* (status checks omitted, but they are check in real code)*:* {code:java} const auto stream = arrow::io::BufferOutputStream::Create(); { const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); sdk::MaybeThrowError(writer); const auto writeRes = (*writer)->WriteRecordBatch(batch); sdk::MaybeThrowError((*writer)->Close()); } auto buffer = (*stream)->Finish();std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size()); ofs.close(); auto backAgain = DeserializeRecordBatch((*buffer)); // all good {code} *Then in Python*: {code:java} with open("record-batch-large.arrow", "rb") as f: data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers message" {code} Please see the attached .arrow file (produced above). Any ideas? was: I'm running into various exceptions (often: "Invalid flatbuffers message") when attempting to de-serialize RecordBatch's in Python that were generated in C++. The same batch can be de-serialized back within C++. *Example (C++)* (status checks omitted, but they are check in real code)*:* {code:java} const auto stream = arrow::io::BufferOutputStream::Create(); { const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); sdk::MaybeThrowError(writer); const auto writeRes = (*writer)->WriteRecordBatch(batch); sdk::MaybeThrowError((*writer)->Close()); } auto buffer = (*stream)->Finish();std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size()); ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good {code} *Then in Python*: {code:java} with open("record-batch-large.arrow", "rb") as f: data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers message" {code} Please see the attached .arrow file (produced above). Any ideas? > "Invalid flatbuffers message" thrown with some serialized RecordBatch's > --- > > Key: ARROW-14263 > URL: https://issues.apache.org/jira/browse/ARROW-14263 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 5.0.0 > Environment: pyarrow==5.0.0 > C++ = 5.0.0 > Windows 10 Pro x64 >Reporter: Bryan Ashby >Priority: Major > Attachments: record-batch-large.arrow > > > I'm running into various exceptions (often: "Invalid flatbuffers message") > when attempting to de-serialize RecordBatch's in Python that were generated > in C++. > The same batch can be de-serialized back within C++. > *Example (C++)* (status checks omitted, but they are check in real code)*:* > {code:java} > const auto stream = arrow::io::BufferOutputStream::Create(); > { > const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); > sdk::MaybeThrowError(writer); > const auto writeRes = (*writer)->WriteRecordBatch(batch); > sdk::MaybeThrowError((*writer)->Close()); > } > auto buffer = (*stream)->Finish();std::ofstream > ofs("record-batch-large.arrow"); // we'll read this in Python > ofs.write(reinterpret_cast((*buffer)->data()), > (*buffer)->size()); > ofs.close(); > auto backAgain = DeserializeRecordBatch((*buffer)); // all good > {code} > *Then in Python*: > {code:java} > with open("record-batch-large.arrow", "rb") as f: > data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here > - "Invalid flatbuffers message" > {code} > Please see the attached .arrow file (produced above). > Any ideas? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's
[ https://issues.apache.org/jira/browse/ARROW-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Ashby updated ARROW-14263: Description: I'm running into various exceptions (often: "Invalid flatbuffers message") when attempting to de-serialize RecordBatch's in Python that were generated in C++. The same batch can be de-serialized back within C++. *Example (C++)* (status checks omitted, but they are check in real code)*:* {code:java} const auto stream = arrow::io::BufferOutputStream::Create(); { const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); sdk::MaybeThrowError(writer); const auto writeRes = (*writer)->WriteRecordBatch(batch); sdk::MaybeThrowError((*writer)->Close()); } auto buffer = (*stream)->Finish();std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size()); ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good {code} *Then in Python*: {code:java} with open("record-batch-large.arrow", "rb") as f: data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers message" {code} Please see the attached .arrow file (produced above). Any ideas? was: I'm running into various exceptions (often: "Invalid flatbuffers message") when attempting to de-serialize RecordBatch's in Python that were generated in C++. The same batch can be de-serialized back within C++. *Example (C++)* (status checks omitted, but they are check in real code)*:* {code:java} const auto stream = arrow::io::BufferOutputStream::Create(); { const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); sdk::MaybeThrowError(writer); const auto writeRes = (*writer)->WriteRecordBatch(batch); sdk::MaybeThrowError((*writer)->Close()); sdk::MaybeThrowError(writeRes); } auto buffer = (*stream)->Finish();std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size()); ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good {code} *Then in Python*: {code:java} with open("record-batch-large.arrow", "rb") as f: data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers message" {code} Please see the attached .arrow file (produced above). Any ideas? > "Invalid flatbuffers message" thrown with some serialized RecordBatch's > --- > > Key: ARROW-14263 > URL: https://issues.apache.org/jira/browse/ARROW-14263 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 5.0.0 > Environment: pyarrow==5.0.0 > C++ = 5.0.0 > Windows 10 Pro x64 >Reporter: Bryan Ashby >Priority: Major > Attachments: record-batch-large.arrow > > > I'm running into various exceptions (often: "Invalid flatbuffers message") > when attempting to de-serialize RecordBatch's in Python that were generated > in C++. > The same batch can be de-serialized back within C++. > *Example (C++)* (status checks omitted, but they are check in real code)*:* > {code:java} > const auto stream = arrow::io::BufferOutputStream::Create(); > { > const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); > sdk::MaybeThrowError(writer); > const auto writeRes = (*writer)->WriteRecordBatch(batch); > sdk::MaybeThrowError((*writer)->Close()); > } > auto buffer = (*stream)->Finish();std::ofstream > ofs("record-batch-large.arrow"); // we'll read this in Python > ofs.write(reinterpret_cast((*buffer)->data()), > (*buffer)->size()); > ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good > {code} > *Then in Python*: > {code:java} > with open("record-batch-large.arrow", "rb") as f: > data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here > - "Invalid flatbuffers message" > {code} > Please see the attached .arrow file (produced above). > Any ideas? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's
[ https://issues.apache.org/jira/browse/ARROW-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Ashby updated ARROW-14263: Description: I'm running into various exceptions (often: "Invalid flatbuffers message") when attempting to de-serialize RecordBatch's in Python that were generated in C++. The same batch can be de-serialized back within C++. *Example (C++)* (status checks omitted), but they are check in real code)*:* {code:java} const auto stream = arrow::io::BufferOutputStream::Create(); { const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); sdk::MaybeThrowError(writer); const auto writeRes = (*writer)->WriteRecordBatch(batch); sdk::MaybeThrowError((*writer)->Close()); sdk::MaybeThrowError(writeRes); } auto buffer = (*stream)->Finish();std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size()); ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good {code} *Then in Python*: {code:java} with open("record-batch-large.arrow", "rb") as f: data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers message" {code} Please see the attached .arrow file (produced above). Any ideas? was: I'm running into various exceptions (often: "Invalid flatbuffers message") when attempting to de-serialize RecordBatch's in Python that were generated in C++. The same batch can be de-serialized back within C++. *Example (C++)* (status checks omitted), but they are check in real code)*:* {code:java} const auto stream = arrow::io::BufferOutputStream::Create(); { const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); sdk::MaybeThrowError(writer); const auto writeRes = (*writer)->WriteRecordBatch(batch); sdk::MaybeThrowError((*writer)->Close()); sdk::MaybeThrowError(writeRes); }auto buffer = (*stream)->Finish();std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size()); ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good {code} *Then in Python*: {code:java} with open("record-batch-large.arrow", "rb") as f: data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers message" {code} Please see the attached .arrow file (produced above). Any ideas? > "Invalid flatbuffers message" thrown with some serialized RecordBatch's > --- > > Key: ARROW-14263 > URL: https://issues.apache.org/jira/browse/ARROW-14263 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 5.0.0 > Environment: pyarrow==5.0.0 > C++ = 5.0.0 > Windows 10 Pro x64 >Reporter: Bryan Ashby >Priority: Major > Attachments: record-batch-large.arrow > > > I'm running into various exceptions (often: "Invalid flatbuffers message") > when attempting to de-serialize RecordBatch's in Python that were generated > in C++. > The same batch can be de-serialized back within C++. > *Example (C++)* (status checks omitted), but they are check in real code)*:* > {code:java} > const auto stream = arrow::io::BufferOutputStream::Create(); > { > const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); > sdk::MaybeThrowError(writer); const auto writeRes = > (*writer)->WriteRecordBatch(batch); > sdk::MaybeThrowError((*writer)->Close()); sdk::MaybeThrowError(writeRes); > } > auto buffer = (*stream)->Finish();std::ofstream > ofs("record-batch-large.arrow"); // we'll read this in Python > ofs.write(reinterpret_cast((*buffer)->data()), > (*buffer)->size()); > ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good > {code} > *Then in Python*: > {code:java} > with open("record-batch-large.arrow", "rb") as f: > data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here > - "Invalid flatbuffers message" > {code} > Please see the attached .arrow file (produced above). > Any ideas? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's
[ https://issues.apache.org/jira/browse/ARROW-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Ashby updated ARROW-14263: Description: I'm running into various exceptions (often: "Invalid flatbuffers message") when attempting to de-serialize RecordBatch's in Python that were generated in C++. The same batch can be de-serialized back within C++. *Example (C++)* (status checks omitted, but they are check in real code)*:* {code:java} const auto stream = arrow::io::BufferOutputStream::Create(); { const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); sdk::MaybeThrowError(writer); const auto writeRes = (*writer)->WriteRecordBatch(batch); sdk::MaybeThrowError((*writer)->Close()); sdk::MaybeThrowError(writeRes); } auto buffer = (*stream)->Finish();std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size()); ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good {code} *Then in Python*: {code:java} with open("record-batch-large.arrow", "rb") as f: data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers message" {code} Please see the attached .arrow file (produced above). Any ideas? was: I'm running into various exceptions (often: "Invalid flatbuffers message") when attempting to de-serialize RecordBatch's in Python that were generated in C++. The same batch can be de-serialized back within C++. *Example (C++)* (status checks omitted), but they are check in real code)*:* {code:java} const auto stream = arrow::io::BufferOutputStream::Create(); { const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); sdk::MaybeThrowError(writer); const auto writeRes = (*writer)->WriteRecordBatch(batch); sdk::MaybeThrowError((*writer)->Close()); sdk::MaybeThrowError(writeRes); } auto buffer = (*stream)->Finish();std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size()); ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good {code} *Then in Python*: {code:java} with open("record-batch-large.arrow", "rb") as f: data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers message" {code} Please see the attached .arrow file (produced above). Any ideas? > "Invalid flatbuffers message" thrown with some serialized RecordBatch's > --- > > Key: ARROW-14263 > URL: https://issues.apache.org/jira/browse/ARROW-14263 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 5.0.0 > Environment: pyarrow==5.0.0 > C++ = 5.0.0 > Windows 10 Pro x64 >Reporter: Bryan Ashby >Priority: Major > Attachments: record-batch-large.arrow > > > I'm running into various exceptions (often: "Invalid flatbuffers message") > when attempting to de-serialize RecordBatch's in Python that were generated > in C++. > The same batch can be de-serialized back within C++. > *Example (C++)* (status checks omitted, but they are check in real code)*:* > {code:java} > const auto stream = arrow::io::BufferOutputStream::Create(); > { > const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); > sdk::MaybeThrowError(writer); const auto writeRes = > (*writer)->WriteRecordBatch(batch); > sdk::MaybeThrowError((*writer)->Close()); sdk::MaybeThrowError(writeRes); > } > auto buffer = (*stream)->Finish();std::ofstream > ofs("record-batch-large.arrow"); // we'll read this in Python > ofs.write(reinterpret_cast((*buffer)->data()), > (*buffer)->size()); > ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good > {code} > *Then in Python*: > {code:java} > with open("record-batch-large.arrow", "rb") as f: > data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here > - "Invalid flatbuffers message" > {code} > Please see the attached .arrow file (produced above). > Any ideas? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's
Bryan Ashby created ARROW-14263: --- Summary: "Invalid flatbuffers message" thrown with some serialized RecordBatch's Key: ARROW-14263 URL: https://issues.apache.org/jira/browse/ARROW-14263 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 5.0.0 Environment: pyarrow==5.0.0 C++ = 5.0.0 Windows 10 Pro x64 Reporter: Bryan Ashby Attachments: record-batch-large.arrow I'm running into various exceptions (often: "Invalid flatbuffers message") when attempting to de-serialize RecordBatch's in Python that were generated in C++. The same batch can be de-serialized back within C++. *Example (C++)* (status checks omitted), but they are check in real code)*:* {code:java} const auto stream = arrow::io::BufferOutputStream::Create(); { const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema); sdk::MaybeThrowError(writer); const auto writeRes = (*writer)->WriteRecordBatch(batch); sdk::MaybeThrowError((*writer)->Close()); sdk::MaybeThrowError(writeRes); }auto buffer = (*stream)->Finish();std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size()); ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good {code} *Then in Python*: {code:java} with open("record-batch-large.arrow", "rb") as f: data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers message" {code} Please see the attached .arrow file (produced above). Any ideas? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-7878) [C++] Implement LogicalPlan and LogicalPlanBuilder
[ https://issues.apache.org/jira/browse/ARROW-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman closed ARROW-7878. --- Resolution: Won't Fix > [C++] Implement LogicalPlan and LogicalPlanBuilder > -- > > Key: ARROW-7878 > URL: https://issues.apache.org/jira/browse/ARROW-7878 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Affects Versions: 0.17.0 >Reporter: Francois Saint-Jacques >Priority: Major > Labels: pull-request-available > Time Spent: 18h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-13739) [R] Support dplyr::count() and tally()
[ https://issues.apache.org/jira/browse/ARROW-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-13739. - Resolution: Fixed Issue resolved by pull request 11306 [https://github.com/apache/arrow/pull/11306] > [R] Support dplyr::count() and tally() > -- > > Key: ARROW-13739 > URL: https://issues.apache.org/jira/browse/ARROW-13739 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Nicola Crane >Priority: Critical > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > These may just work by borrowing the data.frame methods in dplyr -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-4630) [C++] Implement serial version of join
[ https://issues.apache.org/jira/browse/ARROW-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman closed ARROW-4630. --- Resolution: Won't Fix > [C++] Implement serial version of join > -- > > Key: ARROW-4630 > URL: https://issues.apache.org/jira/browse/ARROW-4630 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.12.0 >Reporter: Areg Melik-Adamyan >Assignee: Areg Melik-Adamyan >Priority: Major > > Implement the serial version of join operator. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8765) [C++] Design Scheduler API
[ https://issues.apache.org/jira/browse/ARROW-8765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426346#comment-17426346 ] Neal Richardson commented on ARROW-8765: [~apitrou] this seems handled by the ExecPlan and related work, can we close this or is there more to do? > [C++] Design Scheduler API > -- > > Key: ARROW-8765 > URL: https://issues.apache.org/jira/browse/ARROW-8765 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13078) [R] Bindings for str_replace_na()
[ https://issues.apache.org/jira/browse/ARROW-13078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-13078: --- Assignee: (was: Ian Cook) > [R] Bindings for str_replace_na() > - > > Key: ARROW-13078 > URL: https://issues.apache.org/jira/browse/ARROW-13078 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Priority: Major > Labels: good-second-issue > Fix For: 6.0.0 > > > Implement the stringr function {{str_replace_na()}} which is useful in > combination with {{str_c()}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12137) [R] New/improved vignette on dplyr features
[ https://issues.apache.org/jira/browse/ARROW-12137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-12137: Fix Version/s: (was: 6.0.0) 7.0.0 > [R] New/improved vignette on dplyr features > --- > > Key: ARROW-12137 > URL: https://issues.apache.org/jira/browse/ARROW-12137 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Priority: Major > Fix For: 7.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-12137) [R] New/improved vignette on dplyr features
[ https://issues.apache.org/jira/browse/ARROW-12137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-12137: --- Assignee: (was: Ian Cook) > [R] New/improved vignette on dplyr features > --- > > Key: ARROW-12137 > URL: https://issues.apache.org/jira/browse/ARROW-12137 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Priority: Major > Fix For: 6.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-12763) [R] Optimize dplyr queries that use head/tail after arrange
[ https://issues.apache.org/jira/browse/ARROW-12763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-12763: --- Assignee: Neal Richardson > [R] Optimize dplyr queries that use head/tail after arrange > --- > > Key: ARROW-12763 > URL: https://issues.apache.org/jira/browse/ARROW-12763 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Assignee: Neal Richardson >Priority: Major > Labels: query-engine > Fix For: 6.0.0 > > > Use the Arrow C++ function {{partition_nth_indices}} to optimize dplyr > queries like this: > {code:r} > iris %>% > Table$create() %>% > arrange(desc(Sepal.Length)) %>% > head(10) %>% > collect() > {code} > This query sorts the full table even though it doesn't need to. It could use > {{partition_nth_indices}} to find the rows containing the top 10 values of > {{Sepal.Length}} and only collect and sort those 10 rows. > Test to see if this improves performance in practice on larger data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11057) [Python] Data inconsistency with read and write
[ https://issues.apache.org/jira/browse/ARROW-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426339#comment-17426339 ] Joris Van den Bossche commented on ARROW-11057: --- This logic was changed in PARQUET-1798 / https://github.com/apache/arrow/pull/10289, and now those PARQUET:field_id fields are only preserved if already present, and not automatically generated. If you re-run the example above with recent released pyarrow, you actually get identical files now, and the schemas also don't contains the field_ids anymore. > [Python] Data inconsistency with read and write > --- > > Key: ARROW-11057 > URL: https://issues.apache.org/jira/browse/ARROW-11057 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 >Reporter: David Quijano >Priority: Major > > I have been reading and writing some tables to parquet and I found some > inconsistencies. > {code:java} > # create a table with some data > a = pa.Table.from_pydict({'x': [1]*100,'y': [2]*100,'z': [3]*100,}) > # write it to file > pq.write_table(a, 'test.parquet') > # read the same file > b = pq.read_table('test.parquet') > # a == b is True, that's good > # write table b to file > pq.write_table(b, 'test2.parquet') > # test is different from test2{code} > Basically it is: > * Create table in memory > * Write it to file > * Read it again > * Write it to a different file > The files are not the same. The second one contains extra information. > The differences are consistent across different compressions (I tried snappy > and zstd). > Also, reading the second file and and writing it again, produces the same > file. > Is this a bug or an expected behavior? > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-11057) [Python] Data inconsistency with read and write
[ https://issues.apache.org/jira/browse/ARROW-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche resolved ARROW-11057. --- Resolution: Fixed > [Python] Data inconsistency with read and write > --- > > Key: ARROW-11057 > URL: https://issues.apache.org/jira/browse/ARROW-11057 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 >Reporter: David Quijano >Priority: Major > > I have been reading and writing some tables to parquet and I found some > inconsistencies. > {code:java} > # create a table with some data > a = pa.Table.from_pydict({'x': [1]*100,'y': [2]*100,'z': [3]*100,}) > # write it to file > pq.write_table(a, 'test.parquet') > # read the same file > b = pq.read_table('test.parquet') > # a == b is True, that's good > # write table b to file > pq.write_table(b, 'test2.parquet') > # test is different from test2{code} > Basically it is: > * Create table in memory > * Write it to file > * Read it again > * Write it to a different file > The files are not the same. The second one contains extra information. > The differences are consistent across different compressions (I tried snappy > and zstd). > Also, reading the second file and and writing it again, produces the same > file. > Is this a bug or an expected behavior? > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14254) [C++] Return a random sample of rows from a query
[ https://issues.apache.org/jira/browse/ARROW-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-14254: Labels: kernel query-engine (was: kernel) > [C++] Return a random sample of rows from a query > - > > Key: ARROW-14254 > URL: https://issues.apache.org/jira/browse/ARROW-14254 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nicola Crane >Priority: Major > Labels: kernel, query-engine > Fix For: 7.0.0 > > > Please can we have a kernel that returns a random sample of rows? We've had a > request to be able to do this in R: > https://github.com/apache/arrow-cookbook/issues/83 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14254) [C++] Return a random sample of rows from a query
[ https://issues.apache.org/jira/browse/ARROW-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-14254: Fix Version/s: 7.0.0 > [C++] Return a random sample of rows from a query > - > > Key: ARROW-14254 > URL: https://issues.apache.org/jira/browse/ARROW-14254 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nicola Crane >Priority: Major > Labels: kernel > Fix For: 7.0.0 > > > Please can we have a kernel that returns a random sample of rows? We've had a > request to be able to do this in R: > https://github.com/apache/arrow-cookbook/issues/83 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14257) [Doc][Python] dataset doc build fails
[ https://issues.apache.org/jira/browse/ARROW-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426331#comment-17426331 ] Joris Van den Bossche commented on ARROW-14257: --- But can you only run into this error in a _writing_ context? > [Doc][Python] dataset doc build fails > - > > Key: ARROW-14257 > URL: https://issues.apache.org/jira/browse/ARROW-14257 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, Python >Reporter: Antoine Pitrou >Assignee: Joris Van den Bossche >Priority: Blocker > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > {code} > >>>- > Exception in /home/antoine/arrow/dev/docs/source/python/dataset.rst at block > ending on line 578 > Specify :okexcept: as an option in the ipython:: block to suppress this > message > --- > ArrowNotImplementedError Traceback (most recent call last) > in > > 1 ds.write_dataset(scanner, new_root, format="parquet", > partitioning=new_part) > ~/arrow/dev/python/pyarrow/dataset.py in write_dataset(data, base_dir, > basename_template, format, partitioning, partitioning_flavor, schema, > filesystem, file_options, use_threads, max_partitions, file_visitor) > 861 _filesystemdataset_write( > 862 scanner, base_dir, basename_template, filesystem, > partitioning, > --> 863 file_options, max_partitions, file_visitor > 864 ) > ~/arrow/dev/python/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > ~/arrow/dev/python/pyarrow/error.pxi in pyarrow.lib.check_status() > ArrowNotImplementedError: Asynchronous scanning is not supported by > SyncScanner > /home/antoine/arrow/dev/cpp/src/arrow/dataset/file_base.cc:367 > scanner->ScanBatchesAsync() > <<<- > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8033) [Go][Integration] Enable custom_metadata integtration test
[ https://issues.apache.org/jira/browse/ARROW-8033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol resolved ARROW-8033. -- Assignee: Matthew Topol Resolution: Fixed implementation of the extension type added support for the metadata properly. so resolving this as Go is no longer skipped for the custom metadata tests > [Go][Integration] Enable custom_metadata integtration test > -- > > Key: ARROW-8033 > URL: https://issues.apache.org/jira/browse/ARROW-8033 > Project: Apache Arrow > Issue Type: Improvement > Components: Go, Integration >Affects Versions: 0.16.0 >Reporter: Ben Kietzman >Assignee: Matthew Topol >Priority: Major > > https://github.com/apache/arrow/pull/6556 adds an integration test including > custom metadata but Go is skipped. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14262) [C++] Document and rename is_in_meta_binary
[ https://issues.apache.org/jira/browse/ARROW-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426326#comment-17426326 ] Weston Pace commented on ARROW-14262: - Also, we will need to add them to compute.rst > [C++] Document and rename is_in_meta_binary > --- > > Key: ARROW-14262 > URL: https://issues.apache.org/jira/browse/ARROW-14262 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Priority: Major > > The is_in_meta_binary and index_in_meta_binary functions do not have any > "_doc" elements. I had simply ignored them assuming they were some kind of > specialized function that shouldn't be exposed for general consumption (see > ARROW-13949) but I recently discovered they are legitimate binary variants of > their unary counterparts. > If we want to continue to expose these functions we should rename them (meta > I assume means meta function but the python/r user has no idea what a meta > function is) and add _doc elements. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14262) [C++] Document and rename is_in_meta_binary
Weston Pace created ARROW-14262: --- Summary: [C++] Document and rename is_in_meta_binary Key: ARROW-14262 URL: https://issues.apache.org/jira/browse/ARROW-14262 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Weston Pace The is_in_meta_binary and index_in_meta_binary functions do not have any "_doc" elements. I had simply ignored them assuming they were some kind of specialized function that shouldn't be exposed for general consumption (see ARROW-13949) but I recently discovered they are legitimate binary variants of their unary counterparts. If we want to continue to expose these functions we should rename them (meta I assume means meta function but the python/r user has no idea what a meta function is) and add _doc elements. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13984) [Go][Parquet] Add File Package - readers
[ https://issues.apache.org/jira/browse/ARROW-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426314#comment-17426314 ] Matthew Topol commented on ARROW-13984: --- [~rick...@x14.se] as long as you are in a module (a directory that has a go.mod or whose parent up the chain has a go.mod, you can create a module by running `go mod init ` in a directory) you should be able to run `go get github.com/apache/arrow/go/parquet` to download the parquet package. After that, you'll need to use a replace directive in order to test my branch with the reader which you can do by running `go mod edit -replace=github.com/apache/arrow/go/parquet=github.com/zeroshade/arrow/go/parquet@goparquet-file` i believe. After that you should be able to just import it normally in a .go file by using `import "github.com/apache/arrow/go/parquet"` and so on. Let me know if you run into any issues. > [Go][Parquet] Add File Package - readers > > > Key: ARROW-13984 > URL: https://issues.apache.org/jira/browse/ARROW-13984 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go, Parquet >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Time Spent: 7.5h > Remaining Estimate: 0h > > Add the package for manipulating files directly, column reader/writer, file > reader/writer -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13984) [Go][Parquet] Add File Package - readers
[ https://issues.apache.org/jira/browse/ARROW-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426312#comment-17426312 ] Rickard Lundin commented on ARROW-13984: This is bigger than to see see a Panda being born!I wish i could figure out how to test it. Is it just clone from git and build the whole arrow package? I will try to find the branch name. /Rickard a newborn Golanger > [Go][Parquet] Add File Package - readers > > > Key: ARROW-13984 > URL: https://issues.apache.org/jira/browse/ARROW-13984 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go, Parquet >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Time Spent: 7h 20m > Remaining Estimate: 0h > > Add the package for manipulating files directly, column reader/writer, file > reader/writer -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-14069) [R] By default, filter out hash functions in list_compute_functions()
[ https://issues.apache.org/jira/browse/ARROW-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-14069. - Resolution: Fixed Issue resolved by pull request 11363 [https://github.com/apache/arrow/pull/11363] > [R] By default, filter out hash functions in list_compute_functions() > - > > Key: ARROW-14069 > URL: https://issues.apache.org/jira/browse/ARROW-14069 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Assignee: Nicola Crane >Priority: Critical > Labels: good-first-issue, pull-request-available > Fix For: 6.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > As users can't call hash functions directly in {{list_compute_functions()}}, > we should filter those out so they're not displayed. Perhaps via a parameter > if we still need those for our internal uses of {{list_compute_functions()}}? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-13866) [R] Implement Options for all compute kernels available via list_compute_functions
[ https://issues.apache.org/jira/browse/ARROW-13866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-13866. - Resolution: Fixed > [R] Implement Options for all compute kernels available via > list_compute_functions > -- > > Key: ARROW-13866 > URL: https://issues.apache.org/jira/browse/ARROW-13866 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Assignee: Nicola Crane >Priority: Major > Fix For: 6.0.0 > > > Not all of the compute kernels available via {{list_compute_functions()}} are > actually available to use in R, as they haven't been hooked up to the > relevant Options class in {{r/src/compute.cpp}}. > We should: > # Implement all remaining options classes > # Go through all the kernels listed by {{list_compute_functions()}} and > check that they have either no options classes to implement or that they have > been hooked up to the appropriate options class > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-13901) [R] Implement IndexOptions
[ https://issues.apache.org/jira/browse/ARROW-13901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-13901. - Resolution: Fixed Issue resolved by pull request 11357 [https://github.com/apache/arrow/pull/11357] > [R] Implement IndexOptions > -- > > Key: ARROW-13901 > URL: https://issues.apache.org/jira/browse/ARROW-13901 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Nicola Crane >Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-13924) [R] Bindings for stringr::str_starts, stringr::str_ends, base::startsWith and base::endsWith
[ https://issues.apache.org/jira/browse/ARROW-13924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-13924. - Resolution: Fixed Issue resolved by pull request 11365 [https://github.com/apache/arrow/pull/11365] > [R] Bindings for stringr::str_starts, stringr::str_ends, base::startsWith and > base::endsWith > > > Key: ARROW-13924 > URL: https://issues.apache.org/jira/browse/ARROW-13924 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Assignee: Nicola Crane >Priority: Critical > Labels: good-first-issue, kernel, pull-request-available > Fix For: 6.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13893) [R] Make head/tail lazy on datasets and queries
[ https://issues.apache.org/jira/browse/ARROW-13893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13893: --- Labels: pull-request-available query-engine (was: query-engine) > [R] Make head/tail lazy on datasets and queries > --- > > Key: ARROW-13893 > URL: https://issues.apache.org/jira/browse/ARROW-13893 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Blocker > Labels: pull-request-available, query-engine > Fix For: 6.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13797) [C++] Implement column projection pushdown to ORC reader in Datasets API
[ https://issues.apache.org/jira/browse/ARROW-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-13797: -- Fix Version/s: 6.0.0 > [C++] Implement column projection pushdown to ORC reader in Datasets API > > > Key: ARROW-13797 > URL: https://issues.apache.org/jira/browse/ARROW-13797 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: dataset, orc, pull-request-available > Fix For: 6.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support > for ORC file format in the Datasets API, but the reader still reads all > columns regardless of the ScanOptions. Since ORC is a columnar format that > supports reading only specific fields, we can optimize this step. > The tricky part is to convert the field name of the Arrow schema to the index > in the ORC schema. Currently, this logic is included in the Python bindings > (https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59), > but so this needs to be moved to C++. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13797) [C++] Implement column projection pushdown to ORC reader in Datasets API
[ https://issues.apache.org/jira/browse/ARROW-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-13797: - Assignee: Joris Van den Bossche > [C++] Implement column projection pushdown to ORC reader in Datasets API > > > Key: ARROW-13797 > URL: https://issues.apache.org/jira/browse/ARROW-13797 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: dataset, orc, pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support > for ORC file format in the Datasets API, but the reader still reads all > columns regardless of the ScanOptions. Since ORC is a columnar format that > supports reading only specific fields, we can optimize this step. > The tricky part is to convert the field name of the Arrow schema to the index > in the ORC schema. Currently, this logic is included in the Python bindings > (https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59), > but so this needs to be moved to C++. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13797) [C++] Implement column projection pushdown to ORC reader in Datasets API
[ https://issues.apache.org/jira/browse/ARROW-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13797: --- Labels: dataset orc pull-request-available (was: dataset orc) > [C++] Implement column projection pushdown to ORC reader in Datasets API > > > Key: ARROW-13797 > URL: https://issues.apache.org/jira/browse/ARROW-13797 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset, orc, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support > for ORC file format in the Datasets API, but the reader still reads all > columns regardless of the ScanOptions. Since ORC is a columnar format that > supports reading only specific fields, we can optimize this step. > The tricky part is to convert the field name of the Arrow schema to the index > in the ORC schema. Currently, this logic is included in the Python bindings > (https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59), > but so this needs to be moved to C++. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13887) [R] Capture error produced when reading in CSV file with headers and using a schema, and add suggestion
[ https://issues.apache.org/jira/browse/ARROW-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426224#comment-17426224 ] Dragoș Moldovan-Grünfeld commented on ARROW-13887: -- There is a suggestion from [~westonpace] to still address this issue - i.e. capture the error and give some useful information to the reader. We can then create a separate issue for future on the schema vs col_types issue. FWIW I'm happy with his suggestion. Until we solve the underlying issues a more informative message might be useful. What do you think [~npr] [~jonkeane] [~thisisnic]? > [R] Capture error produced when reading in CSV file with headers and using a > schema, and add suggestion > --- > > Key: ARROW-13887 > URL: https://issues.apache.org/jira/browse/ARROW-13887 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: good-first-issue > Fix For: 6.0.0 > > > When reading in a CSV with headers, and also using a schema, we get an error > as the code tries to read in the header as a line of data. > {code:java} > share_data <- tibble::tibble( > company = c("AMZN", "GOOG", "BKNG", "TSLA"), > price = c(3463.12, 2884.38, 2300.46, 732.39) > ) > readr::write_csv(share_data, file = "share_data.csv") > share_schema <- schema( > company = utf8(), > price = float64() > ) > read_csv_arrow("share_data.csv", schema = share_schema) > {code} > {code:java} > Error: Invalid: In CSV column #1: CSV conversion error to double: invalid > value 'price' > /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:492 decoder_.Decode(data, > size, quoted, &value) > /home/nic2/arrow/cpp/src/arrow/csv/parser.h:84 status > /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:496 > parser.VisitColumn(col_index, visit) {code} > The correct thing here would have been for the user to supply the argument > {{skip=1}} to {{read_csv_arrow()}} but this is not immediately obvious from > the error message returned from C++. We should capture the error and instead > supply our own error message using {{rlang::abort}} which informs the user of > the error and then suggests what they can do to prevent it. > > For similar examples (and their associated PRs) see > {color:#1d1c1d}ARROW-11766, and ARROW-12791{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14252) [R] Partial matching of arguments warning
[ https://issues.apache.org/jira/browse/ARROW-14252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-14252: --- Labels: pull-request-available (was: ) > [R] Partial matching of arguments warning > - > > Key: ARROW-14252 > URL: https://issues.apache.org/jira/browse/ARROW-14252 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > There are a few examples of partially matched arguments in the code. One > example is below, but there could be others. > {code:r} > Failure (test-dplyr-query.R:46:3): dim() on query > `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > record_batch(tbl` threw an unexpected warning. > Message: partial match of 'filtered' to 'filtered_rows' > Class: simpleWarning/warning/condition > Backtrace: > 1. arrow:::expect_dplyr_equal(...) test-dplyr-query.R:46:2 > 11. arrow::dim.arrow_dplyr_query(.) > 12. base::isTRUE(x$filtered) /Users/dragos/Documents/arrow/r/R/dplyr.R:147:2 > Failure (test-dplyr-query.R:46:3): dim() on query > `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = > Table$create(tbl` threw an unexpected warning. > Message: partial match of 'filtered' to 'filtered_rows' > Class: simpleWarning/warning/condition > Backtrace: > 1. arrow:::expect_dplyr_equal(...) test-dplyr-query.R:46:2 > 11. arrow::dim.arrow_dplyr_query(.) > 12. base::isTRUE(x$filtered) /Users/dragos/Documents/arrow/r/R/dplyr.R:147:2 > {code} > This is the relevant line of code in the example above: > https://github.com/apache/arrow/blob/25a6f591d1f162106b74e29870ebd4012e9874cc/r/R/dplyr.R#L150 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13988) [C++] Support binary-like types in hash_min_max, hash_min, hash_max
[ https://issues.apache.org/jira/browse/ARROW-13988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li reassigned ARROW-13988: Assignee: David Li > [C++] Support binary-like types in hash_min_max, hash_min, hash_max > --- > > Key: ARROW-13988 > URL: https://issues.apache.org/jira/browse/ARROW-13988 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: kernel > > An extension to ARROW-13882. Non-fixed-width types will need a separate > approach, so this was split out to a new JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13947) [C++] index kernel missing support for decimal, null, and fixed_size_binary
[ https://issues.apache.org/jira/browse/ARROW-13947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13947: --- Labels: kernel pull-request-available query-engine (was: kernel query-engine) > [C++] index kernel missing support for decimal, null, and fixed_size_binary > --- > > Key: ARROW-13947 > URL: https://issues.apache.org/jira/browse/ARROW-13947 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Assignee: David Li >Priority: Major > Labels: kernel, pull-request-available, query-engine > Time Spent: 10m > Remaining Estimate: 0h > > The "index" kernel should support any equatable type. At the moment it does > not support decimal, fixed_size_binary, or null. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13947) [C++] index kernel missing support for decimal, null, and fixed_size_binary
[ https://issues.apache.org/jira/browse/ARROW-13947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li reassigned ARROW-13947: Assignee: David Li > [C++] index kernel missing support for decimal, null, and fixed_size_binary > --- > > Key: ARROW-13947 > URL: https://issues.apache.org/jira/browse/ARROW-13947 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Assignee: David Li >Priority: Major > Labels: kernel, query-engine > > The "index" kernel should support any equatable type. At the moment it does > not support decimal, fixed_size_binary, or null. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14213) R arrow package not working on RStudio/Ubuntu
[ https://issues.apache.org/jira/browse/ARROW-14213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426206#comment-17426206 ] Neal Richardson commented on ARROW-14213: - The build output says: {code} Binary package requires libcurl and openssl If installation fails, retry after installing those system requirements {code} Can you install those and retry? > R arrow package not working on RStudio/Ubuntu > - > > Key: ARROW-14213 > URL: https://issues.apache.org/jira/browse/ARROW-14213 > Project: Apache Arrow > Issue Type: Bug > Components: R > Environment: R version 3.6.3 (2020-02-29) -- "Holding the Windsock" > Copyright (C) 2020 The R Foundation for Statistical Computing > Platform: x86_64-pc-linux-gnu (64-bit) >Reporter: Thomas Wutzler >Priority: Major > > I try reading feather files in R with the arrow package that were generated > in Python. > I run on R 3.6.3 on an RStudio server window on linux machine, for which I > have no other access. I get the message: > {{Cannot call io___MemoryMappedFile__Open().}} > According to the advice in the linked help-file: > [https://cran.r-project.org/web/packages/arrow/vignettes/install.html] I > create this issue with the full log of the installation: > {{}} > > arrow::install_arrow(verbose = TRUE)Installing package into > > '/Net/Groups/BGI/scratch/twutz/R/atacama-library/3.6' > (as 'lib' is unspecified)trying URL > 'https://ftp5.gwdg.de/pub/misc/cran/src/contrib/arrow_5.0.0.2.tar.gz'Content > type 'application/octet-stream' length 483642 bytes (472 > KB)==downloaded 472 KB* > installing *source* package 'arrow' ...** package 'arrow' successfully > unpacked and MD5 sums checked** using staged installationtrying URL > 'https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/ubuntu-16.04/arrow-5.0.0.2.zip'Content > type 'binary/octet-stream' length 17214781 bytes (16.4 > MB)==downloaded 16.4 MB*** > Successfully retrieved C++ binaries for ubuntu-16.04 > Binary package requires libcurl and openssl > If installation fails, retry after installing those system requirements > PKG_CFLAGS=-I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include > -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET > -DARROW_R_WITH_S3 > PKG_LIBS=-L/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/lib > -larrow_dataset -lparquet -larrow -larrow -larrow_bundled_dependencies > -larrow_dataset -lparquet -lssl -lcrypto -lcurl** libsg++ -std=gnu++11 > -I"/usr/share/R/include" -DNDEBUG > -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include > -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET > -DARROW_R_WITH_S3 -I../inst/include/-fpic -g -O2 > -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time > -D_FORTIFY_SOURCE=2 -g -c RTasks.cpp -o RTasks.o > g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG > -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include > -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET > -DARROW_R_WITH_S3 -I../inst/include/-fpic -g -O2 > -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time > -D_FORTIFY_SOURCE=2 -g -c altrep.cpp -o altrep.o > g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG > -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include > -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET > -DARROW_R_WITH_S3 -I../inst/include/-fpic -g -O2 > -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time > -D_FORTIFY_SOURCE=2 -g -c array.cpp -o array.o > g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG > -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include > -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET > -DARROW_R_WITH_S3 -I../inst/include/-fpic -g -O2 > -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time > -D_FORTIFY_SOURCE=2 -g -c array_to_vector.cpp -o array_to_vector.o > g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG > -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include > -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET > -DARROW_R_WITH_S3 -I../inst/include/-fpic -g -O2 > -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time > -D_FORTIFY_SOURCE=2 -g -c arraydata.cpp -o arraydata.o > g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG > -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include > -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET > -DARROW_R_WITH_S3 -I../inst/includ
[jira] [Updated] (ARROW-13111) [R] altrep vectors for ChunkedArray
[ https://issues.apache.org/jira/browse/ARROW-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13111: --- Labels: pull-request-available (was: ) > [R] altrep vectors for ChunkedArray > --- > > Key: ARROW-13111 > URL: https://issues.apache.org/jira/browse/ARROW-13111 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Romain Francois >Assignee: Romain Francois >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13948) [C++] index_in/is_in kernels missing support for timestamp with timezone
[ https://issues.apache.org/jira/browse/ARROW-13948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li reassigned ARROW-13948: Assignee: David Li > [C++] index_in/is_in kernels missing support for timestamp with timezone > > > Key: ARROW-13948 > URL: https://issues.apache.org/jira/browse/ARROW-13948 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Assignee: David Li >Priority: Major > Labels: good-first-issue, kernel, pull-request-available, > query-engine > Time Spent: 20m > Remaining Estimate: 0h > > The index_in and is_in kernels should support all equatable value types. At > the moment it supports all except for timestamp types that have a timezone. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13948) [C++] index_in/is_in kernels missing support for timestamp with timezone
[ https://issues.apache.org/jira/browse/ARROW-13948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13948: --- Labels: good-first-issue kernel pull-request-available query-engine (was: good-first-issue kernel query-engine) > [C++] index_in/is_in kernels missing support for timestamp with timezone > > > Key: ARROW-13948 > URL: https://issues.apache.org/jira/browse/ARROW-13948 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Priority: Major > Labels: good-first-issue, kernel, pull-request-available, > query-engine > Time Spent: 10m > Remaining Estimate: 0h > > The index_in and is_in kernels should support all equatable value types. At > the moment it supports all except for timestamp types that have a timezone. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer
[ https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-14196: -- Fix Version/s: 6.0.0 > [C++][Parquet] Default to compliant nested types in Parquet writer > -- > > Key: ARROW-14196 > URL: https://issues.apache.org/jira/browse/ARROW-14196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Joris Van den Bossche >Priority: Major > Fix For: 6.0.0 > > > In C++ there is already an option to get the "compliant_nested_types" (to > have the list columns follow the Parquet specification), and ARROW-11497 > exposed this option in Python. > This is still set to False by default, but in the source it says "TODO: At > some point we should flip this.", and in ARROW-11497 there was also some > discussion about what it would take to change the default. > cc [~emkornfield] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer
[ https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426167#comment-17426167 ] Jim Pivarski commented on ARROW-14196: -- If it's changing a default, I can just set the option explicitly, right? If it's changing how columns are named in the file (currently "fieldA.list.fieldB.list", etc.), then that would require adjustment on my side, but it would be an adjustment that depends on how the file was written, not the pyarrow version, right? If so, I'd have to support both naming conventions because both would exist in the wild. If there is a change that Awkward Array has to adjust to, then now may be a good time to do it because we're going to be rewriting the "from_parquet" function soon, as part of integrating with Dask https://github.com/ContinuumIO/dask-awkward adopting fsspec, and using pyarrow's Dataset API (to replace our manual implementation). If there's something that we can set to get the new behavior, we can turn that on now while developing the new version and be ready for the change. > [C++][Parquet] Default to compliant nested types in Parquet writer > -- > > Key: ARROW-14196 > URL: https://issues.apache.org/jira/browse/ARROW-14196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Joris Van den Bossche >Priority: Major > > In C++ there is already an option to get the "compliant_nested_types" (to > have the list columns follow the Parquet specification), and ARROW-11497 > exposed this option in Python. > This is still set to False by default, but in the source it says "TODO: At > some point we should flip this.", and in ARROW-11497 there was also some > discussion about what it would take to change the default. > cc [~emkornfield] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14259) [R] converting from R vector to Array when the R vector is altrep
[ https://issues.apache.org/jira/browse/ARROW-14259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-14259: --- Labels: pull-request-available (was: ) > [R] converting from R vector to Array when the R vector is altrep > - > > Key: ARROW-14259 > URL: https://issues.apache.org/jira/browse/ARROW-14259 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Romain Francois >Assignee: Romain Francois >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When we have an R vector that was created from an Array with altrep, and then > we want to convert again to an Array, currently it materializes it, and it > should not. Instead it should be grabbing the array from the internals -of > the altrep object. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14257) [Doc][Python] dataset doc build fails
[ https://issues.apache.org/jira/browse/ARROW-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426081#comment-17426081 ] Weston Pace commented on ARROW-14257: - In Python it is always use_async=True. In R the scanner is hidden from the user on dataset writes but the option there is use_async as well. In C++ the option is UseAsync in the ScannerBuilder. How about, "Writing datasets requires that the input scanner is configured to scan asynchronously via the use_async or UseAsync options." > [Doc][Python] dataset doc build fails > - > > Key: ARROW-14257 > URL: https://issues.apache.org/jira/browse/ARROW-14257 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, Python >Reporter: Antoine Pitrou >Assignee: Joris Van den Bossche >Priority: Blocker > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > {code} > >>>- > Exception in /home/antoine/arrow/dev/docs/source/python/dataset.rst at block > ending on line 578 > Specify :okexcept: as an option in the ipython:: block to suppress this > message > --- > ArrowNotImplementedError Traceback (most recent call last) > in > > 1 ds.write_dataset(scanner, new_root, format="parquet", > partitioning=new_part) > ~/arrow/dev/python/pyarrow/dataset.py in write_dataset(data, base_dir, > basename_template, format, partitioning, partitioning_flavor, schema, > filesystem, file_options, use_threads, max_partitions, file_visitor) > 861 _filesystemdataset_write( > 862 scanner, base_dir, basename_template, filesystem, > partitioning, > --> 863 file_options, max_partitions, file_visitor > 864 ) > ~/arrow/dev/python/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > ~/arrow/dev/python/pyarrow/error.pxi in pyarrow.lib.check_status() > ArrowNotImplementedError: Asynchronous scanning is not supported by > SyncScanner > /home/antoine/arrow/dev/cpp/src/arrow/dataset/file_base.cc:367 > scanner->ScanBatchesAsync() > <<<- > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-13924) [R] Bindings for stringr::str_starts, stringr::str_ends, base::startsWith and base::endsWith
[ https://issues.apache.org/jira/browse/ARROW-13924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-13924: --- Labels: good-first-issue kernel pull-request-available (was: good-first-issue kernel) > [R] Bindings for stringr::str_starts, stringr::str_ends, base::startsWith and > base::endsWith > > > Key: ARROW-13924 > URL: https://issues.apache.org/jira/browse/ARROW-13924 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Assignee: Nicola Crane >Priority: Critical > Labels: good-first-issue, kernel, pull-request-available > Fix For: 6.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14257) [Doc][Python] dataset doc build fails
[ https://issues.apache.org/jira/browse/ARROW-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426037#comment-17426037 ] Joris Van den Bossche commented on ARROW-14257: --- The error message is indeed not very helpful. We could mention something about using the {{use_async}} scan option, although I am not fully sure it would be applicable in all cases where you can run into this error? > [Doc][Python] dataset doc build fails > - > > Key: ARROW-14257 > URL: https://issues.apache.org/jira/browse/ARROW-14257 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, Python >Reporter: Antoine Pitrou >Assignee: Joris Van den Bossche >Priority: Blocker > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > {code} > >>>- > Exception in /home/antoine/arrow/dev/docs/source/python/dataset.rst at block > ending on line 578 > Specify :okexcept: as an option in the ipython:: block to suppress this > message > --- > ArrowNotImplementedError Traceback (most recent call last) > in > > 1 ds.write_dataset(scanner, new_root, format="parquet", > partitioning=new_part) > ~/arrow/dev/python/pyarrow/dataset.py in write_dataset(data, base_dir, > basename_template, format, partitioning, partitioning_flavor, schema, > filesystem, file_options, use_threads, max_partitions, file_visitor) > 861 _filesystemdataset_write( > 862 scanner, base_dir, basename_template, filesystem, > partitioning, > --> 863 file_options, max_partitions, file_visitor > 864 ) > ~/arrow/dev/python/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > ~/arrow/dev/python/pyarrow/error.pxi in pyarrow.lib.check_status() > ArrowNotImplementedError: Asynchronous scanning is not supported by > SyncScanner > /home/antoine/arrow/dev/cpp/src/arrow/dataset/file_base.cc:367 > scanner->ScanBatchesAsync() > <<<- > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14257) [Doc][Python] dataset doc build fails
[ https://issues.apache.org/jira/browse/ARROW-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-14257: --- Labels: pull-request-available (was: ) > [Doc][Python] dataset doc build fails > - > > Key: ARROW-14257 > URL: https://issues.apache.org/jira/browse/ARROW-14257 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, Python >Reporter: Antoine Pitrou >Assignee: Joris Van den Bossche >Priority: Blocker > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > {code} > >>>- > Exception in /home/antoine/arrow/dev/docs/source/python/dataset.rst at block > ending on line 578 > Specify :okexcept: as an option in the ipython:: block to suppress this > message > --- > ArrowNotImplementedError Traceback (most recent call last) > in > > 1 ds.write_dataset(scanner, new_root, format="parquet", > partitioning=new_part) > ~/arrow/dev/python/pyarrow/dataset.py in write_dataset(data, base_dir, > basename_template, format, partitioning, partitioning_flavor, schema, > filesystem, file_options, use_threads, max_partitions, file_visitor) > 861 _filesystemdataset_write( > 862 scanner, base_dir, basename_template, filesystem, > partitioning, > --> 863 file_options, max_partitions, file_visitor > 864 ) > ~/arrow/dev/python/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > ~/arrow/dev/python/pyarrow/error.pxi in pyarrow.lib.check_status() > ArrowNotImplementedError: Asynchronous scanning is not supported by > SyncScanner > /home/antoine/arrow/dev/cpp/src/arrow/dataset/file_base.cc:367 > scanner->ScanBatchesAsync() > <<<- > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-14257) [Doc][Python] dataset doc build fails
[ https://issues.apache.org/jira/browse/ARROW-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-14257: - Assignee: Joris Van den Bossche > [Doc][Python] dataset doc build fails > - > > Key: ARROW-14257 > URL: https://issues.apache.org/jira/browse/ARROW-14257 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, Python >Reporter: Antoine Pitrou >Assignee: Joris Van den Bossche >Priority: Blocker > Fix For: 6.0.0 > > > {code} > >>>- > Exception in /home/antoine/arrow/dev/docs/source/python/dataset.rst at block > ending on line 578 > Specify :okexcept: as an option in the ipython:: block to suppress this > message > --- > ArrowNotImplementedError Traceback (most recent call last) > in > > 1 ds.write_dataset(scanner, new_root, format="parquet", > partitioning=new_part) > ~/arrow/dev/python/pyarrow/dataset.py in write_dataset(data, base_dir, > basename_template, format, partitioning, partitioning_flavor, schema, > filesystem, file_options, use_threads, max_partitions, file_visitor) > 861 _filesystemdataset_write( > 862 scanner, base_dir, basename_template, filesystem, > partitioning, > --> 863 file_options, max_partitions, file_visitor > 864 ) > ~/arrow/dev/python/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > ~/arrow/dev/python/pyarrow/error.pxi in pyarrow.lib.check_status() > ArrowNotImplementedError: Asynchronous scanning is not supported by > SyncScanner > /home/antoine/arrow/dev/cpp/src/arrow/dataset/file_base.cc:367 > scanner->ScanBatchesAsync() > <<<- > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-13924) [R] Bindings for stringr::str_starts, stringr::str_ends, base::startsWith and base::endsWith
[ https://issues.apache.org/jira/browse/ARROW-13924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane reassigned ARROW-13924: Assignee: Nicola Crane > [R] Bindings for stringr::str_starts, stringr::str_ends, base::startsWith and > base::endsWith > > > Key: ARROW-13924 > URL: https://issues.apache.org/jira/browse/ARROW-13924 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Assignee: Nicola Crane >Priority: Critical > Labels: good-first-issue, kernel > Fix For: 6.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-14069) [R] By default, filter out hash functions in list_compute_functions()
[ https://issues.apache.org/jira/browse/ARROW-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-14069: --- Labels: good-first-issue pull-request-available (was: good-first-issue) > [R] By default, filter out hash functions in list_compute_functions() > - > > Key: ARROW-14069 > URL: https://issues.apache.org/jira/browse/ARROW-14069 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Assignee: Nicola Crane >Priority: Critical > Labels: good-first-issue, pull-request-available > Fix For: 6.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > As users can't call hash functions directly in {{list_compute_functions()}}, > we should filter those out so they're not displayed. Perhaps via a parameter > if we still need those for our internal uses of {{list_compute_functions()}}? -- This message was sent by Atlassian Jira (v8.3.4#803005)