[jira] [Resolved] (ARROW-14261) [C++] Includes should be in alphabetical order

2021-10-08 Thread Yibo Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai resolved ARROW-14261.
--
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11362
[https://github.com/apache/arrow/pull/11362]

> [C++] Includes should be in alphabetical order
> --
>
> Key: ARROW-14261
> URL: https://issues.apache.org/jira/browse/ARROW-14261
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benson Muite
>Assignee: Benson Muite
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Includes in
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/registry_internal.h]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/registry.cc
> should be in alphabetical order



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14222) [C++] Create GcsFileSystem skeleton

2021-10-08 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-14222.
--
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11331
[https://github.com/apache/arrow/pull/11331]

> [C++] Create GcsFileSystem skeleton
> ---
>
> Key: ARROW-14222
> URL: https://issues.apache.org/jira/browse/ARROW-14222
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Carlos O'Ryan
>Assignee: Carlos O'Ryan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Implement a skeleton for GCSFileSystem. All functions would return 
> `Status::NotImplemented()`. This will keep the future changes smaller, and 
> allow me to verify all CI builds are working in a smaller PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14267) [Python] Cannot convert pd.DataFrame with geometry cells to pa.Table

2021-10-08 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-14267:
-
Summary: [Python] Cannot convert pd.DataFrame with geometry cells to 
pa.Table  (was: Cannot convert pd.DataFrame with geometry cells to pa.Table)

> [Python] Cannot convert pd.DataFrame with geometry cells to pa.Table
> 
>
> Key: ARROW-14267
> URL: https://issues.apache.org/jira/browse/ARROW-14267
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 5.0.0
>Reporter: Henrikh Kantuni
>Priority: Minor
>  Labels: pyarrow
>
> Example: 
> {code:java}
> import geopandas as gpd
> import pandas as pd
> import pyarrow as pa
> path = gpd.datasets.get_path("naturalearth_lowres")
> data = gpd.read_file(path)
> df = pd.DataFrame(data)
> table = pa.Table.from_pandas(df)
> print(table)
> {code}
> Throws the following error:
> {code:java}
> Traceback (most recent call last):
>  File "/Users/Henrikh/Desktop/tmp.py", line 8, in 
>  table = pa.Table.from_pandas(df)
>  File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 594, in dataframe_to_arrays
>  arrays = [convert_column(c, f)
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 594, in 
>  arrays = [convert_column(c, f)
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 581, in convert_column
>  raise e
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 575, in convert_column
>  result = pa.array(col, type=type_, from_pandas=True, safe=safe)
>  File "pyarrow/array.pxi", line 302, in pyarrow.lib.array
>  File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array
>  File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type
>  File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
> pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion 
> failed for column geometry with type geometry'){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14267) Cannot convert pd.DataFrame with geometry cells to pa.Table

2021-10-08 Thread Henrikh Kantuni (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrikh Kantuni updated ARROW-14267:

Summary: Cannot convert pd.DataFrame with geometry cells to pa.Table  (was: 
Cannot convert DataFrame with geometry cells to Table)

> Cannot convert pd.DataFrame with geometry cells to pa.Table
> ---
>
> Key: ARROW-14267
> URL: https://issues.apache.org/jira/browse/ARROW-14267
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 5.0.0
>Reporter: Henrikh Kantuni
>Priority: Minor
>  Labels: pyarrow
>
> Example: 
> {code:java}
> import geopandas as gpd
> import pandas as pd
> import pyarrow as pa
> path = gpd.datasets.get_path("naturalearth_lowres")
> data = gpd.read_file(path)
> df = pd.DataFrame(data)
> table = pa.Table.from_pandas(df)
> print(table)
> {code}
> Throws the following error:
> {code:java}
> Traceback (most recent call last):
>  File "/Users/Henrikh/Desktop/tmp.py", line 8, in 
>  table = pa.Table.from_pandas(df)
>  File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 594, in dataframe_to_arrays
>  arrays = [convert_column(c, f)
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 594, in 
>  arrays = [convert_column(c, f)
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 581, in convert_column
>  raise e
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 575, in convert_column
>  result = pa.array(col, type=type_, from_pandas=True, safe=safe)
>  File "pyarrow/array.pxi", line 302, in pyarrow.lib.array
>  File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array
>  File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type
>  File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
> pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion 
> failed for column geometry with type geometry'){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14268) Cannot convert DataFrame with complex128 cells to Table

2021-10-08 Thread Henrikh Kantuni (Jira)
Henrikh Kantuni created ARROW-14268:
---

 Summary: Cannot convert DataFrame with complex128 cells to Table
 Key: ARROW-14268
 URL: https://issues.apache.org/jira/browse/ARROW-14268
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 5.0.0
Reporter: Henrikh Kantuni


Example:

 
{code:java}
import numpy as np
import pandas as pd
import pyarrow as pa

data = np.array([1 + 1j])
df = pd.DataFrame(data)
table = pa.Table.from_pandas(df)
print(table)
{code}
Throws the following error:
{code:java}
Traceback (most recent call last):
 File "/Users/Henrikh/Desktop/tmp.py", line 7, in 
 table = pa.Table.from_pandas(df)
 File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas
 File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
594, in dataframe_to_arrays
 arrays = [convert_column(c, f)
 File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
594, in 
 arrays = [convert_column(c, f)
 File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
581, in convert_column
 raise e
 File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
575, in convert_column
 result = pa.array(col, type=type_, from_pandas=True, safe=safe)
 File "pyarrow/array.pxi", line 302, in pyarrow.lib.array
 File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array
 File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type
 File "pyarrow/error.pxi", line 118, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: ('Unsupported numpy type 15', 'Conversion 
failed for column 0 with type complex128'){code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14268) Cannot convert pd.DataFrame with complex128 cells to pa.Table

2021-10-08 Thread Henrikh Kantuni (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrikh Kantuni updated ARROW-14268:

Summary: Cannot convert pd.DataFrame with complex128 cells to pa.Table  
(was: Cannot convert DataFrame with complex128 cells to Table)

> Cannot convert pd.DataFrame with complex128 cells to pa.Table
> -
>
> Key: ARROW-14268
> URL: https://issues.apache.org/jira/browse/ARROW-14268
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 5.0.0
>Reporter: Henrikh Kantuni
>Priority: Minor
>  Labels: pyarrow
>
> Example:
>  
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> data = np.array([1 + 1j])
> df = pd.DataFrame(data)
> table = pa.Table.from_pandas(df)
> print(table)
> {code}
> Throws the following error:
> {code:java}
> Traceback (most recent call last):
>  File "/Users/Henrikh/Desktop/tmp.py", line 7, in 
>  table = pa.Table.from_pandas(df)
>  File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 594, in dataframe_to_arrays
>  arrays = [convert_column(c, f)
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 594, in 
>  arrays = [convert_column(c, f)
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 581, in convert_column
>  raise e
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 575, in convert_column
>  result = pa.array(col, type=type_, from_pandas=True, safe=safe)
>  File "pyarrow/array.pxi", line 302, in pyarrow.lib.array
>  File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array
>  File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type
>  File "pyarrow/error.pxi", line 118, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: ('Unsupported numpy type 15', 
> 'Conversion failed for column 0 with type complex128'){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14267) Cannot convert DataFrame with geometry cells to Table

2021-10-08 Thread Henrikh Kantuni (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrikh Kantuni updated ARROW-14267:

Summary: Cannot convert DataFrame with geometry cells to Table  (was: 
Cannot convert DataFrame with geometry `numpy.dtype` cells to Table)

> Cannot convert DataFrame with geometry cells to Table
> -
>
> Key: ARROW-14267
> URL: https://issues.apache.org/jira/browse/ARROW-14267
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 5.0.0
>Reporter: Henrikh Kantuni
>Priority: Minor
>  Labels: pyarrow
>
> Example: 
> {code:java}
> import geopandas as gpd
> import pandas as pd
> import pyarrow as pa
> path = gpd.datasets.get_path("naturalearth_lowres")
> data = gpd.read_file(path)
> df = pd.DataFrame(data)
> table = pa.Table.from_pandas(df)
> print(table)
> {code}
> Throws the following error:
> {code:java}
> Traceback (most recent call last):
>  File "/Users/Henrikh/Desktop/tmp.py", line 8, in 
>  table = pa.Table.from_pandas(df)
>  File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 594, in dataframe_to_arrays
>  arrays = [convert_column(c, f)
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 594, in 
>  arrays = [convert_column(c, f)
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 581, in convert_column
>  raise e
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 575, in convert_column
>  result = pa.array(col, type=type_, from_pandas=True, safe=safe)
>  File "pyarrow/array.pxi", line 302, in pyarrow.lib.array
>  File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array
>  File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type
>  File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
> pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion 
> failed for column geometry with type geometry'){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14267) Cannot convert DataFrame with geometry `numpy.dtype` cells to Table

2021-10-08 Thread Henrikh Kantuni (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrikh Kantuni updated ARROW-14267:

Description: 
Example: 
{code:java}
import geopandas as gpd
import pandas as pd
import pyarrow as pa


path = gpd.datasets.get_path("naturalearth_lowres")
data = gpd.read_file(path)
df = pd.DataFrame(data)
table = pa.Table.from_pandas(df)
print(table)
{code}
Throws the following error:
{code:java}
Traceback (most recent call last):
 File "/Users/Henrikh/Desktop/tmp.py", line 8, in 
 table = pa.Table.from_pandas(df)
 File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas
 File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
594, in dataframe_to_arrays
 arrays = [convert_column(c, f)
 File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
594, in 
 arrays = [convert_column(c, f)
 File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
581, in convert_column
 raise e
 File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
575, in convert_column
 result = pa.array(col, type=type_, from_pandas=True, safe=safe)
 File "pyarrow/array.pxi", line 302, in pyarrow.lib.array
 File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array
 File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type
 File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion 
failed for column geometry with type geometry'){code}
 

  was:
Example:

 
{code:java}
import geopandas as gpd
import pandas as pd
import pyarrow as pa
path = gpd.datasets.get_path("naturalearth_lowres")
data = gpd.read_file(path)
df = pd.DataFrame(data)
table = pa.Table.from_pandas(df)
print(table)
{code}
Throws the following error:
{code:java}
Traceback (most recent call last):
 File "/Users/Henrikh/Desktop/tmp.py", line 8, in 
 table = pa.Table.from_pandas(df)
 File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas
 File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
594, in dataframe_to_arrays
 arrays = [convert_column(c, f)
 File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
594, in 
 arrays = [convert_column(c, f)
 File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
581, in convert_column
 raise e
 File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
575, in convert_column
 result = pa.array(col, type=type_, from_pandas=True, safe=safe)
 File "pyarrow/array.pxi", line 302, in pyarrow.lib.array
 File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array
 File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type
 File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion 
failed for column geometry with type geometry'){code}
 


> Cannot convert DataFrame with geometry `numpy.dtype` cells to Table
> ---
>
> Key: ARROW-14267
> URL: https://issues.apache.org/jira/browse/ARROW-14267
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 5.0.0
>Reporter: Henrikh Kantuni
>Priority: Minor
>  Labels: pyarrow
>
> Example: 
> {code:java}
> import geopandas as gpd
> import pandas as pd
> import pyarrow as pa
> path = gpd.datasets.get_path("naturalearth_lowres")
> data = gpd.read_file(path)
> df = pd.DataFrame(data)
> table = pa.Table.from_pandas(df)
> print(table)
> {code}
> Throws the following error:
> {code:java}
> Traceback (most recent call last):
>  File "/Users/Henrikh/Desktop/tmp.py", line 8, in 
>  table = pa.Table.from_pandas(df)
>  File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 594, in dataframe_to_arrays
>  arrays = [convert_column(c, f)
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 594, in 
>  arrays = [convert_column(c, f)
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 581, in convert_column
>  raise e
>  File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
> 575, in convert_column
>  result = pa.array(col, type=type_, from_pandas=True, safe=safe)
>  File "pyarrow/array.pxi", line 302, in pyarrow.lib.array
>  File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array
>  File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type
>  File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
> pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion 
> failed for column geometry with type geometry'){code}
>  



--

[jira] [Created] (ARROW-14267) Cannot convert DataFrame with geometry `numpy.dtype` cells to Table

2021-10-08 Thread Henrikh Kantuni (Jira)
Henrikh Kantuni created ARROW-14267:
---

 Summary: Cannot convert DataFrame with geometry `numpy.dtype` 
cells to Table
 Key: ARROW-14267
 URL: https://issues.apache.org/jira/browse/ARROW-14267
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 5.0.0
Reporter: Henrikh Kantuni


Example:

 
{code:java}
import geopandas as gpd
import pandas as pd
import pyarrow as pa
path = gpd.datasets.get_path("naturalearth_lowres")
data = gpd.read_file(path)
df = pd.DataFrame(data)
table = pa.Table.from_pandas(df)
print(table)
{code}
Throws the following error:
{code:java}
Traceback (most recent call last):
 File "/Users/Henrikh/Desktop/tmp.py", line 8, in 
 table = pa.Table.from_pandas(df)
 File "pyarrow/table.pxi", line 1553, in pyarrow.lib.Table.from_pandas
 File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
594, in dataframe_to_arrays
 arrays = [convert_column(c, f)
 File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
594, in 
 arrays = [convert_column(c, f)
 File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
581, in convert_column
 raise e
 File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 
575, in convert_column
 result = pa.array(col, type=type_, from_pandas=True, safe=safe)
 File "pyarrow/array.pxi", line 302, in pyarrow.lib.array
 File "pyarrow/array.pxi", line 79, in pyarrow.lib._ndarray_to_array
 File "pyarrow/array.pxi", line 67, in pyarrow.lib._ndarray_to_type
 File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion 
failed for column geometry with type geometry'){code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14266) [R] Use WriteNode to write queries

2021-10-08 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-14266:
---

 Summary: [R] Use WriteNode to write queries
 Key: ARROW-14266
 URL: https://issues.apache.org/jira/browse/ARROW-14266
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
 Fix For: 7.0.0


Following ARROW-13542. Any query that has a join or an aggregation currently 
has to first evaluate the query and hold it in memory before creating a Scanner 
to write it. We could improve that by using a WriteNode inside write_dataset() 
(and maybe that improves the other cases too, or at least allows us to delete 
some code). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14265) [C++][Gandiva] Add support for LLVM 13

2021-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14265:
---
Labels: pull-request-available  (was: )

> [C++][Gandiva] Add support for LLVM 13
> --
>
> Key: ARROW-14265
> URL: https://issues.apache.org/jira/browse/ARROW-14265
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14265) [C++][Gandiva] Add support for LLVM 13

2021-10-08 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-14265:
-
Summary: [C++][Gandiva] Add support for LLVM 13  (was: [C++][Gandiva] 
Support building with LLVM 13)

> [C++][Gandiva] Add support for LLVM 13
> --
>
> Key: ARROW-14265
> URL: https://issues.apache.org/jira/browse/ARROW-14265
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14265) [C++][Gandiva] Support building with LLVM 13

2021-10-08 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-14265:


 Summary: [C++][Gandiva] Support building with LLVM 13
 Key: ARROW-14265
 URL: https://issues.apache.org/jira/browse/ARROW-14265
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14264) [R] Support inequality joins

2021-10-08 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-14264:
---

 Summary: [R] Support inequality joins
 Key: ARROW-14264
 URL: https://issues.apache.org/jira/browse/ARROW-14264
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
 Fix For: 7.0.0


We'll need this not-yet-merged dplyr API to do it: 
https://github.com/tidyverse/dplyr/pull/5910



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14264) [R] Support inequality joins

2021-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14264:

Labels: query-engine  (was: )

> [R] Support inequality joins
> 
>
> Key: ARROW-14264
> URL: https://issues.apache.org/jira/browse/ARROW-14264
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 7.0.0
>
>
> We'll need this not-yet-merged dplyr API to do it: 
> https://github.com/tidyverse/dplyr/pull/5910



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's

2021-10-08 Thread Bryan Ashby (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Ashby updated ARROW-14263:

Environment: 
pyarrow==5.0.0
C++ = 5.0.0
Windows 10 Pro x64
Python 3.8.5


  was:
pyarrow==5.0.0
C++ = 5.0.0
Windows 10 Pro x64




> "Invalid flatbuffers message" thrown with some serialized RecordBatch's
> ---
>
> Key: ARROW-14263
> URL: https://issues.apache.org/jira/browse/ARROW-14263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 5.0.0
> Environment: pyarrow==5.0.0
> C++ = 5.0.0
> Windows 10 Pro x64
> Python 3.8.5
>Reporter: Bryan Ashby
>Priority: Major
> Attachments: record-batch-large.arrow
>
>
> I'm running into various exceptions (often: "Invalid flatbuffers message") 
> when attempting to de-serialize RecordBatch's in Python that were generated 
> in C++.
> The same batch can be de-serialized back within C++.
> *Example (C++)* (status checks omitted, but they are check in real code)*:*
> {code:java}
> const auto stream = arrow::io::BufferOutputStream::Create();
> {
>   const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
>   const auto writeRes = (*writer)->WriteRecordBatch(batch);
>   (*writer)->Close();
> }
> auto buffer = (*stream)->Finish();
> std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python
> ofs.write(reinterpret_cast((*buffer)->data()), 
> (*buffer)->size());
> ofs.close();
> auto backAgain = DeserializeRecordBatch((*buffer)); // all good
> {code}
> *Then in Python*:
> {code:java}
> with open("record-batch-large.arrow", "rb") as f:
>   data = f.read()
> reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid 
> flatbuffers message"
> {code}
> Please see the attached .arrow file (produced above).
> Any ideas?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's

2021-10-08 Thread Bryan Ashby (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Ashby updated ARROW-14263:

Description: 
I'm running into various exceptions (often: "Invalid flatbuffers message") when 
attempting to de-serialize RecordBatch's in Python that were generated in C++.

The same batch can be de-serialized back within C++.

*Example (C++)* (status checks omitted, but they are check in real code)*:*
{code:java}
const auto stream = arrow::io::BufferOutputStream::Create();

{
  const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
  sdk::MaybeThrowError(writer); 
  const auto writeRes = (*writer)->WriteRecordBatch(batch);
  sdk::MaybeThrowError((*writer)->Close());
}

auto buffer = (*stream)->Finish();std::ofstream 
ofs("record-batch-large.arrow"); // we'll read this in Python
ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size());
ofs.close();

auto backAgain = DeserializeRecordBatch((*buffer)); // all good
{code}
*Then in Python*:
{code:java}
with open("record-batch-large.arrow", "rb") as f:
data = f.read()

reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers 
message"
{code}
Please see the attached .arrow file (produced above).

Any ideas?

  was:
I'm running into various exceptions (often: "Invalid flatbuffers message") when 
attempting to de-serialize RecordBatch's in Python that were generated in C++.

The same batch can be de-serialized back within C++.

*Example (C++)* (status checks omitted, but they are check in real code)*:*
{code:java}
const auto stream = arrow::io::BufferOutputStream::Create();

{
  const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
  sdk::MaybeThrowError(writer); 
  const auto writeRes = (*writer)->WriteRecordBatch(batch);
  sdk::MaybeThrowError((*writer)->Close());
}

auto buffer = (*stream)->Finish();std::ofstream 
ofs("record-batch-large.arrow"); // we'll read this in Python
ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size());
ofs.close();

auto backAgain = DeserializeRecordBatch((*buffer)); // all good
{code}
*Then in Python*:
{code:java}
with open("record-batch-large.arrow", "rb") as f:
data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here 
- "Invalid flatbuffers message"
{code}
Please see the attached .arrow file (produced above).

Any ideas?


> "Invalid flatbuffers message" thrown with some serialized RecordBatch's
> ---
>
> Key: ARROW-14263
> URL: https://issues.apache.org/jira/browse/ARROW-14263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 5.0.0
> Environment: pyarrow==5.0.0
> C++ = 5.0.0
> Windows 10 Pro x64
>Reporter: Bryan Ashby
>Priority: Major
> Attachments: record-batch-large.arrow
>
>
> I'm running into various exceptions (often: "Invalid flatbuffers message") 
> when attempting to de-serialize RecordBatch's in Python that were generated 
> in C++.
> The same batch can be de-serialized back within C++.
> *Example (C++)* (status checks omitted, but they are check in real code)*:*
> {code:java}
> const auto stream = arrow::io::BufferOutputStream::Create();
> {
>   const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
>   sdk::MaybeThrowError(writer);   
>   const auto writeRes = (*writer)->WriteRecordBatch(batch);
>   sdk::MaybeThrowError((*writer)->Close());
> }
> auto buffer = (*stream)->Finish();std::ofstream 
> ofs("record-batch-large.arrow"); // we'll read this in Python
> ofs.write(reinterpret_cast((*buffer)->data()), 
> (*buffer)->size());
> ofs.close();
> auto backAgain = DeserializeRecordBatch((*buffer)); // all good
> {code}
> *Then in Python*:
> {code:java}
> with open("record-batch-large.arrow", "rb") as f:
>   data = f.read()
> reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid 
> flatbuffers message"
> {code}
> Please see the attached .arrow file (produced above).
> Any ideas?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's

2021-10-08 Thread Bryan Ashby (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Ashby updated ARROW-14263:

Description: 
I'm running into various exceptions (often: "Invalid flatbuffers message") when 
attempting to de-serialize RecordBatch's in Python that were generated in C++.

The same batch can be de-serialized back within C++.

*Example (C++)* (status checks omitted, but they are check in real code)*:*
{code:java}
const auto stream = arrow::io::BufferOutputStream::Create();

{
  const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
  const auto writeRes = (*writer)->WriteRecordBatch(batch);
  (*writer)->Close();
}

auto buffer = (*stream)->Finish();

std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python
ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size());
ofs.close();

auto backAgain = DeserializeRecordBatch((*buffer)); // all good
{code}
*Then in Python*:
{code:java}
with open("record-batch-large.arrow", "rb") as f:
data = f.read()

reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers 
message"
{code}
Please see the attached .arrow file (produced above).

Any ideas?

  was:
I'm running into various exceptions (often: "Invalid flatbuffers message") when 
attempting to de-serialize RecordBatch's in Python that were generated in C++.

The same batch can be de-serialized back within C++.

*Example (C++)* (status checks omitted, but they are check in real code)*:*
{code:java}
const auto stream = arrow::io::BufferOutputStream::Create();

{
  const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
  const auto writeRes = (*writer)->WriteRecordBatch(batch);
  sdk::MaybeThrowError((*writer)->Close());
}

auto buffer = (*stream)->Finish();

std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python
ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size());
ofs.close();

auto backAgain = DeserializeRecordBatch((*buffer)); // all good
{code}
*Then in Python*:
{code:java}
with open("record-batch-large.arrow", "rb") as f:
data = f.read()

reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers 
message"
{code}
Please see the attached .arrow file (produced above).

Any ideas?


> "Invalid flatbuffers message" thrown with some serialized RecordBatch's
> ---
>
> Key: ARROW-14263
> URL: https://issues.apache.org/jira/browse/ARROW-14263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 5.0.0
> Environment: pyarrow==5.0.0
> C++ = 5.0.0
> Windows 10 Pro x64
>Reporter: Bryan Ashby
>Priority: Major
> Attachments: record-batch-large.arrow
>
>
> I'm running into various exceptions (often: "Invalid flatbuffers message") 
> when attempting to de-serialize RecordBatch's in Python that were generated 
> in C++.
> The same batch can be de-serialized back within C++.
> *Example (C++)* (status checks omitted, but they are check in real code)*:*
> {code:java}
> const auto stream = arrow::io::BufferOutputStream::Create();
> {
>   const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
>   const auto writeRes = (*writer)->WriteRecordBatch(batch);
>   (*writer)->Close();
> }
> auto buffer = (*stream)->Finish();
> std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python
> ofs.write(reinterpret_cast((*buffer)->data()), 
> (*buffer)->size());
> ofs.close();
> auto backAgain = DeserializeRecordBatch((*buffer)); // all good
> {code}
> *Then in Python*:
> {code:java}
> with open("record-batch-large.arrow", "rb") as f:
>   data = f.read()
> reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid 
> flatbuffers message"
> {code}
> Please see the attached .arrow file (produced above).
> Any ideas?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's

2021-10-08 Thread Bryan Ashby (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Ashby updated ARROW-14263:

Description: 
I'm running into various exceptions (often: "Invalid flatbuffers message") when 
attempting to de-serialize RecordBatch's in Python that were generated in C++.

The same batch can be de-serialized back within C++.

*Example (C++)* (status checks omitted, but they are check in real code)*:*
{code:java}
const auto stream = arrow::io::BufferOutputStream::Create();

{
  const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
  const auto writeRes = (*writer)->WriteRecordBatch(batch);
  sdk::MaybeThrowError((*writer)->Close());
}

auto buffer = (*stream)->Finish();

std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python
ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size());
ofs.close();

auto backAgain = DeserializeRecordBatch((*buffer)); // all good
{code}
*Then in Python*:
{code:java}
with open("record-batch-large.arrow", "rb") as f:
data = f.read()

reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers 
message"
{code}
Please see the attached .arrow file (produced above).

Any ideas?

  was:
I'm running into various exceptions (often: "Invalid flatbuffers message") when 
attempting to de-serialize RecordBatch's in Python that were generated in C++.

The same batch can be de-serialized back within C++.

*Example (C++)* (status checks omitted, but they are check in real code)*:*
{code:java}
const auto stream = arrow::io::BufferOutputStream::Create();

{
  const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
  sdk::MaybeThrowError(writer); 
  const auto writeRes = (*writer)->WriteRecordBatch(batch);
  sdk::MaybeThrowError((*writer)->Close());
}

auto buffer = (*stream)->Finish();std::ofstream 
ofs("record-batch-large.arrow"); // we'll read this in Python
ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size());
ofs.close();

auto backAgain = DeserializeRecordBatch((*buffer)); // all good
{code}
*Then in Python*:
{code:java}
with open("record-batch-large.arrow", "rb") as f:
data = f.read()

reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid flatbuffers 
message"
{code}
Please see the attached .arrow file (produced above).

Any ideas?


> "Invalid flatbuffers message" thrown with some serialized RecordBatch's
> ---
>
> Key: ARROW-14263
> URL: https://issues.apache.org/jira/browse/ARROW-14263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 5.0.0
> Environment: pyarrow==5.0.0
> C++ = 5.0.0
> Windows 10 Pro x64
>Reporter: Bryan Ashby
>Priority: Major
> Attachments: record-batch-large.arrow
>
>
> I'm running into various exceptions (often: "Invalid flatbuffers message") 
> when attempting to de-serialize RecordBatch's in Python that were generated 
> in C++.
> The same batch can be de-serialized back within C++.
> *Example (C++)* (status checks omitted, but they are check in real code)*:*
> {code:java}
> const auto stream = arrow::io::BufferOutputStream::Create();
> {
>   const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
>   const auto writeRes = (*writer)->WriteRecordBatch(batch);
>   sdk::MaybeThrowError((*writer)->Close());
> }
> auto buffer = (*stream)->Finish();
> std::ofstream ofs("record-batch-large.arrow"); // we'll read this in Python
> ofs.write(reinterpret_cast((*buffer)->data()), 
> (*buffer)->size());
> ofs.close();
> auto backAgain = DeserializeRecordBatch((*buffer)); // all good
> {code}
> *Then in Python*:
> {code:java}
> with open("record-batch-large.arrow", "rb") as f:
>   data = f.read()
> reader = pa.RecordBatchStreamReader(data) // throws here - "Invalid 
> flatbuffers message"
> {code}
> Please see the attached .arrow file (produced above).
> Any ideas?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's

2021-10-08 Thread Bryan Ashby (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Ashby updated ARROW-14263:

Description: 
I'm running into various exceptions (often: "Invalid flatbuffers message") when 
attempting to de-serialize RecordBatch's in Python that were generated in C++.

The same batch can be de-serialized back within C++.

*Example (C++)* (status checks omitted, but they are check in real code)*:*
{code:java}
const auto stream = arrow::io::BufferOutputStream::Create();

{
  const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
  sdk::MaybeThrowError(writer); 
  const auto writeRes = (*writer)->WriteRecordBatch(batch);
  sdk::MaybeThrowError((*writer)->Close());
}

auto buffer = (*stream)->Finish();std::ofstream 
ofs("record-batch-large.arrow"); // we'll read this in Python
ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size());
ofs.close();

auto backAgain = DeserializeRecordBatch((*buffer)); // all good
{code}
*Then in Python*:
{code:java}
with open("record-batch-large.arrow", "rb") as f:
data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here 
- "Invalid flatbuffers message"
{code}
Please see the attached .arrow file (produced above).

Any ideas?

  was:
I'm running into various exceptions (often: "Invalid flatbuffers message") when 
attempting to de-serialize RecordBatch's in Python that were generated in C++.

The same batch can be de-serialized back within C++.

*Example (C++)* (status checks omitted, but they are check in real code)*:*
{code:java}
const auto stream = arrow::io::BufferOutputStream::Create();
{
  const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
  sdk::MaybeThrowError(writer); 
  const auto writeRes = (*writer)->WriteRecordBatch(batch);
  sdk::MaybeThrowError((*writer)->Close());
}
auto buffer = (*stream)->Finish();std::ofstream 
ofs("record-batch-large.arrow"); // we'll read this in Python
ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size());
ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good
{code}
*Then in Python*:
{code:java}
with open("record-batch-large.arrow", "rb") as f:
data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here 
- "Invalid flatbuffers message"
{code}
Please see the attached .arrow file (produced above).

Any ideas?


> "Invalid flatbuffers message" thrown with some serialized RecordBatch's
> ---
>
> Key: ARROW-14263
> URL: https://issues.apache.org/jira/browse/ARROW-14263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 5.0.0
> Environment: pyarrow==5.0.0
> C++ = 5.0.0
> Windows 10 Pro x64
>Reporter: Bryan Ashby
>Priority: Major
> Attachments: record-batch-large.arrow
>
>
> I'm running into various exceptions (often: "Invalid flatbuffers message") 
> when attempting to de-serialize RecordBatch's in Python that were generated 
> in C++.
> The same batch can be de-serialized back within C++.
> *Example (C++)* (status checks omitted, but they are check in real code)*:*
> {code:java}
> const auto stream = arrow::io::BufferOutputStream::Create();
> {
>   const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
>   sdk::MaybeThrowError(writer);   
>   const auto writeRes = (*writer)->WriteRecordBatch(batch);
>   sdk::MaybeThrowError((*writer)->Close());
> }
> auto buffer = (*stream)->Finish();std::ofstream 
> ofs("record-batch-large.arrow"); // we'll read this in Python
> ofs.write(reinterpret_cast((*buffer)->data()), 
> (*buffer)->size());
> ofs.close();
> auto backAgain = DeserializeRecordBatch((*buffer)); // all good
> {code}
> *Then in Python*:
> {code:java}
> with open("record-batch-large.arrow", "rb") as f:
>   data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here 
> - "Invalid flatbuffers message"
> {code}
> Please see the attached .arrow file (produced above).
> Any ideas?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's

2021-10-08 Thread Bryan Ashby (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Ashby updated ARROW-14263:

Description: 
I'm running into various exceptions (often: "Invalid flatbuffers message") when 
attempting to de-serialize RecordBatch's in Python that were generated in C++.

The same batch can be de-serialized back within C++.

*Example (C++)* (status checks omitted, but they are check in real code)*:*
{code:java}
const auto stream = arrow::io::BufferOutputStream::Create();
{
  const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
  sdk::MaybeThrowError(writer); 
  const auto writeRes = (*writer)->WriteRecordBatch(batch);
  sdk::MaybeThrowError((*writer)->Close());
}
auto buffer = (*stream)->Finish();std::ofstream 
ofs("record-batch-large.arrow"); // we'll read this in Python
ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size());
ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good
{code}
*Then in Python*:
{code:java}
with open("record-batch-large.arrow", "rb") as f:
data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here 
- "Invalid flatbuffers message"
{code}
Please see the attached .arrow file (produced above).

Any ideas?

  was:
I'm running into various exceptions (often: "Invalid flatbuffers message") when 
attempting to de-serialize RecordBatch's in Python that were generated in C++.

The same batch can be de-serialized back within C++.

*Example (C++)* (status checks omitted, but they are check in real code)*:*
{code:java}
const auto stream = arrow::io::BufferOutputStream::Create();
{
  const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
  sdk::MaybeThrowError(writer); const auto writeRes = 
(*writer)->WriteRecordBatch(batch);
  sdk::MaybeThrowError((*writer)->Close()); sdk::MaybeThrowError(writeRes);
}
auto buffer = (*stream)->Finish();std::ofstream 
ofs("record-batch-large.arrow"); // we'll read this in Python
ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size());
ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good
{code}
*Then in Python*:
{code:java}
with open("record-batch-large.arrow", "rb") as f:
data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here 
- "Invalid flatbuffers message"
{code}
Please see the attached .arrow file (produced above).

Any ideas?


> "Invalid flatbuffers message" thrown with some serialized RecordBatch's
> ---
>
> Key: ARROW-14263
> URL: https://issues.apache.org/jira/browse/ARROW-14263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 5.0.0
> Environment: pyarrow==5.0.0
> C++ = 5.0.0
> Windows 10 Pro x64
>Reporter: Bryan Ashby
>Priority: Major
> Attachments: record-batch-large.arrow
>
>
> I'm running into various exceptions (often: "Invalid flatbuffers message") 
> when attempting to de-serialize RecordBatch's in Python that were generated 
> in C++.
> The same batch can be de-serialized back within C++.
> *Example (C++)* (status checks omitted, but they are check in real code)*:*
> {code:java}
> const auto stream = arrow::io::BufferOutputStream::Create();
> {
>   const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
>   sdk::MaybeThrowError(writer);   
>   const auto writeRes = (*writer)->WriteRecordBatch(batch);
>   sdk::MaybeThrowError((*writer)->Close());
> }
> auto buffer = (*stream)->Finish();std::ofstream 
> ofs("record-batch-large.arrow"); // we'll read this in Python
> ofs.write(reinterpret_cast((*buffer)->data()), 
> (*buffer)->size());
> ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good
> {code}
> *Then in Python*:
> {code:java}
> with open("record-batch-large.arrow", "rb") as f:
>   data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here 
> - "Invalid flatbuffers message"
> {code}
> Please see the attached .arrow file (produced above).
> Any ideas?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's

2021-10-08 Thread Bryan Ashby (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Ashby updated ARROW-14263:

Description: 
I'm running into various exceptions (often: "Invalid flatbuffers message") when 
attempting to de-serialize RecordBatch's in Python that were generated in C++.

The same batch can be de-serialized back within C++.

*Example (C++)* (status checks omitted), but they are check in real code)*:*
{code:java}
const auto stream = arrow::io::BufferOutputStream::Create();
{
  const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
  sdk::MaybeThrowError(writer); const auto writeRes = 
(*writer)->WriteRecordBatch(batch);
  sdk::MaybeThrowError((*writer)->Close()); sdk::MaybeThrowError(writeRes);
}
auto buffer = (*stream)->Finish();std::ofstream 
ofs("record-batch-large.arrow"); // we'll read this in Python
ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size());
ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good
{code}
*Then in Python*:
{code:java}
with open("record-batch-large.arrow", "rb") as f:
data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here 
- "Invalid flatbuffers message"
{code}
Please see the attached .arrow file (produced above).

Any ideas?

  was:
I'm running into various exceptions (often: "Invalid flatbuffers message") when 
attempting to de-serialize RecordBatch's in Python that were generated in C++.

The same batch can be de-serialized back within C++.

*Example (C++)* (status checks omitted), but they are check in real code)*:*
{code:java}
const auto stream = arrow::io::BufferOutputStream::Create();
{
const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
sdk::MaybeThrowError(writer);   const auto writeRes = 
(*writer)->WriteRecordBatch(batch);
sdk::MaybeThrowError((*writer)->Close());   
sdk::MaybeThrowError(writeRes);
}auto buffer = (*stream)->Finish();std::ofstream 
ofs("record-batch-large.arrow"); // we'll read this in Python
ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size());
ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good
{code}
*Then in Python*:
{code:java}
with open("record-batch-large.arrow", "rb") as f:
data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here 
- "Invalid flatbuffers message"
{code}
Please see the attached .arrow file (produced above).

Any ideas?


> "Invalid flatbuffers message" thrown with some serialized RecordBatch's
> ---
>
> Key: ARROW-14263
> URL: https://issues.apache.org/jira/browse/ARROW-14263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 5.0.0
> Environment: pyarrow==5.0.0
> C++ = 5.0.0
> Windows 10 Pro x64
>Reporter: Bryan Ashby
>Priority: Major
> Attachments: record-batch-large.arrow
>
>
> I'm running into various exceptions (often: "Invalid flatbuffers message") 
> when attempting to de-serialize RecordBatch's in Python that were generated 
> in C++.
> The same batch can be de-serialized back within C++.
> *Example (C++)* (status checks omitted), but they are check in real code)*:*
> {code:java}
> const auto stream = arrow::io::BufferOutputStream::Create();
> {
>   const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
>   sdk::MaybeThrowError(writer);   const auto writeRes = 
> (*writer)->WriteRecordBatch(batch);
>   sdk::MaybeThrowError((*writer)->Close());   sdk::MaybeThrowError(writeRes);
> }
> auto buffer = (*stream)->Finish();std::ofstream 
> ofs("record-batch-large.arrow"); // we'll read this in Python
> ofs.write(reinterpret_cast((*buffer)->data()), 
> (*buffer)->size());
> ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good
> {code}
> *Then in Python*:
> {code:java}
> with open("record-batch-large.arrow", "rb") as f:
>   data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here 
> - "Invalid flatbuffers message"
> {code}
> Please see the attached .arrow file (produced above).
> Any ideas?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's

2021-10-08 Thread Bryan Ashby (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Ashby updated ARROW-14263:

Description: 
I'm running into various exceptions (often: "Invalid flatbuffers message") when 
attempting to de-serialize RecordBatch's in Python that were generated in C++.

The same batch can be de-serialized back within C++.

*Example (C++)* (status checks omitted, but they are check in real code)*:*
{code:java}
const auto stream = arrow::io::BufferOutputStream::Create();
{
  const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
  sdk::MaybeThrowError(writer); const auto writeRes = 
(*writer)->WriteRecordBatch(batch);
  sdk::MaybeThrowError((*writer)->Close()); sdk::MaybeThrowError(writeRes);
}
auto buffer = (*stream)->Finish();std::ofstream 
ofs("record-batch-large.arrow"); // we'll read this in Python
ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size());
ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good
{code}
*Then in Python*:
{code:java}
with open("record-batch-large.arrow", "rb") as f:
data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here 
- "Invalid flatbuffers message"
{code}
Please see the attached .arrow file (produced above).

Any ideas?

  was:
I'm running into various exceptions (often: "Invalid flatbuffers message") when 
attempting to de-serialize RecordBatch's in Python that were generated in C++.

The same batch can be de-serialized back within C++.

*Example (C++)* (status checks omitted), but they are check in real code)*:*
{code:java}
const auto stream = arrow::io::BufferOutputStream::Create();
{
  const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
  sdk::MaybeThrowError(writer); const auto writeRes = 
(*writer)->WriteRecordBatch(batch);
  sdk::MaybeThrowError((*writer)->Close()); sdk::MaybeThrowError(writeRes);
}
auto buffer = (*stream)->Finish();std::ofstream 
ofs("record-batch-large.arrow"); // we'll read this in Python
ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size());
ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good
{code}
*Then in Python*:
{code:java}
with open("record-batch-large.arrow", "rb") as f:
data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here 
- "Invalid flatbuffers message"
{code}
Please see the attached .arrow file (produced above).

Any ideas?


> "Invalid flatbuffers message" thrown with some serialized RecordBatch's
> ---
>
> Key: ARROW-14263
> URL: https://issues.apache.org/jira/browse/ARROW-14263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 5.0.0
> Environment: pyarrow==5.0.0
> C++ = 5.0.0
> Windows 10 Pro x64
>Reporter: Bryan Ashby
>Priority: Major
> Attachments: record-batch-large.arrow
>
>
> I'm running into various exceptions (often: "Invalid flatbuffers message") 
> when attempting to de-serialize RecordBatch's in Python that were generated 
> in C++.
> The same batch can be de-serialized back within C++.
> *Example (C++)* (status checks omitted, but they are check in real code)*:*
> {code:java}
> const auto stream = arrow::io::BufferOutputStream::Create();
> {
>   const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
>   sdk::MaybeThrowError(writer);   const auto writeRes = 
> (*writer)->WriteRecordBatch(batch);
>   sdk::MaybeThrowError((*writer)->Close());   sdk::MaybeThrowError(writeRes);
> }
> auto buffer = (*stream)->Finish();std::ofstream 
> ofs("record-batch-large.arrow"); // we'll read this in Python
> ofs.write(reinterpret_cast((*buffer)->data()), 
> (*buffer)->size());
> ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good
> {code}
> *Then in Python*:
> {code:java}
> with open("record-batch-large.arrow", "rb") as f:
>   data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here 
> - "Invalid flatbuffers message"
> {code}
> Please see the attached .arrow file (produced above).
> Any ideas?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14263) "Invalid flatbuffers message" thrown with some serialized RecordBatch's

2021-10-08 Thread Bryan Ashby (Jira)
Bryan Ashby created ARROW-14263:
---

 Summary: "Invalid flatbuffers message" thrown with some serialized 
RecordBatch's
 Key: ARROW-14263
 URL: https://issues.apache.org/jira/browse/ARROW-14263
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 5.0.0
 Environment: pyarrow==5.0.0
C++ = 5.0.0
Windows 10 Pro x64


Reporter: Bryan Ashby
 Attachments: record-batch-large.arrow

I'm running into various exceptions (often: "Invalid flatbuffers message") when 
attempting to de-serialize RecordBatch's in Python that were generated in C++.

The same batch can be de-serialized back within C++.

*Example (C++)* (status checks omitted), but they are check in real code)*:*
{code:java}
const auto stream = arrow::io::BufferOutputStream::Create();
{
const auto writer = arrow::ipc::MakeStreamWriter(*stream, schema);
sdk::MaybeThrowError(writer);   const auto writeRes = 
(*writer)->WriteRecordBatch(batch);
sdk::MaybeThrowError((*writer)->Close());   
sdk::MaybeThrowError(writeRes);
}auto buffer = (*stream)->Finish();std::ofstream 
ofs("record-batch-large.arrow"); // we'll read this in Python
ofs.write(reinterpret_cast((*buffer)->data()), (*buffer)->size());
ofs.close();auto backAgain = DeserializeRecordBatch((*buffer)); // all good
{code}
*Then in Python*:
{code:java}
with open("record-batch-large.arrow", "rb") as f:
data = f.read()reader = pa.RecordBatchStreamReader(data) // throws here 
- "Invalid flatbuffers message"
{code}
Please see the attached .arrow file (produced above).

Any ideas?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-7878) [C++] Implement LogicalPlan and LogicalPlanBuilder

2021-10-08 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman closed ARROW-7878.
---
Resolution: Won't Fix

> [C++] Implement LogicalPlan and LogicalPlanBuilder
> --
>
> Key: ARROW-7878
> URL: https://issues.apache.org/jira/browse/ARROW-7878
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 18h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13739) [R] Support dplyr::count() and tally()

2021-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13739.
-
Resolution: Fixed

Issue resolved by pull request 11306
[https://github.com/apache/arrow/pull/11306]

> [R] Support dplyr::count() and tally()
> --
>
> Key: ARROW-13739
> URL: https://issues.apache.org/jira/browse/ARROW-13739
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Nicola Crane
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> These may just work by borrowing the data.frame methods in dplyr



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-4630) [C++] Implement serial version of join

2021-10-08 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman closed ARROW-4630.
---
Resolution: Won't Fix

> [C++] Implement serial version of join
> --
>
> Key: ARROW-4630
> URL: https://issues.apache.org/jira/browse/ARROW-4630
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.12.0
>Reporter: Areg Melik-Adamyan
>Assignee: Areg Melik-Adamyan
>Priority: Major
>
> Implement the serial version of join operator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8765) [C++] Design Scheduler API

2021-10-08 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426346#comment-17426346
 ] 

Neal Richardson commented on ARROW-8765:


[~apitrou] this seems handled by the ExecPlan and related work, can we close 
this or is there more to do?

> [C++] Design Scheduler API
> --
>
> Key: ARROW-8765
> URL: https://issues.apache.org/jira/browse/ARROW-8765
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13078) [R] Bindings for str_replace_na()

2021-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-13078:
---

Assignee: (was: Ian Cook)

> [R] Bindings for str_replace_na()
> -
>
> Key: ARROW-13078
> URL: https://issues.apache.org/jira/browse/ARROW-13078
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
>  Labels: good-second-issue
> Fix For: 6.0.0
>
>
> Implement the stringr function {{str_replace_na()}} which is useful in 
> combination with {{str_c()}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12137) [R] New/improved vignette on dplyr features

2021-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12137:

Fix Version/s: (was: 6.0.0)
   7.0.0

> [R] New/improved vignette on dplyr features
> ---
>
> Key: ARROW-12137
> URL: https://issues.apache.org/jira/browse/ARROW-12137
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 7.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12137) [R] New/improved vignette on dplyr features

2021-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-12137:
---

Assignee: (was: Ian Cook)

> [R] New/improved vignette on dplyr features
> ---
>
> Key: ARROW-12137
> URL: https://issues.apache.org/jira/browse/ARROW-12137
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 6.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12763) [R] Optimize dplyr queries that use head/tail after arrange

2021-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-12763:
---

Assignee: Neal Richardson

> [R] Optimize dplyr queries that use head/tail after arrange
> ---
>
> Key: ARROW-12763
> URL: https://issues.apache.org/jira/browse/ARROW-12763
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Assignee: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Use the Arrow C++ function {{partition_nth_indices}} to optimize dplyr 
> queries like this:
> {code:r}
> iris %>%
>   Table$create() %>% 
>   arrange(desc(Sepal.Length)) %>%
>   head(10) %>%
>   collect()
> {code}
> This query sorts the full table even though it doesn't need to. It could use 
> {{partition_nth_indices}} to find the rows containing the top 10 values of 
> {{Sepal.Length}} and only collect and sort those 10 rows.
> Test to see if this improves performance in practice on larger data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11057) [Python] Data inconsistency with read and write

2021-10-08 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426339#comment-17426339
 ] 

Joris Van den Bossche commented on ARROW-11057:
---

This logic was changed in PARQUET-1798 / 
https://github.com/apache/arrow/pull/10289, and now those PARQUET:field_id 
fields are only preserved if already present, and not automatically generated. 

If you re-run the example above with recent released pyarrow, you actually get 
identical files now, and the schemas also don't contains the field_ids anymore.

> [Python] Data inconsistency with read and write
> ---
>
> Key: ARROW-11057
> URL: https://issues.apache.org/jira/browse/ARROW-11057
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
>Reporter: David Quijano
>Priority: Major
>
> I have been reading and writing some tables to parquet and I found some 
> inconsistencies.
> {code:java}
> # create a table with some data
> a = pa.Table.from_pydict({'x': [1]*100,'y': [2]*100,'z': [3]*100,})
> # write it to file
> pq.write_table(a, 'test.parquet')
> # read the same file
> b = pq.read_table('test.parquet')
> # a == b is True, that's good
> # write table b to file
> pq.write_table(b, 'test2.parquet')
> # test is different from test2{code}
> Basically it is:
>  * Create table in memory
>  * Write it to file
>  * Read it again
>  * Write it to a different file
> The files are not the same. The second one contains extra information.
> The differences are consistent across different compressions (I tried snappy 
> and zstd).
> Also, reading the second file and and writing it again, produces the same 
> file.
> Is this a bug or an expected behavior?
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11057) [Python] Data inconsistency with read and write

2021-10-08 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-11057.
---
Resolution: Fixed

> [Python] Data inconsistency with read and write
> ---
>
> Key: ARROW-11057
> URL: https://issues.apache.org/jira/browse/ARROW-11057
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
>Reporter: David Quijano
>Priority: Major
>
> I have been reading and writing some tables to parquet and I found some 
> inconsistencies.
> {code:java}
> # create a table with some data
> a = pa.Table.from_pydict({'x': [1]*100,'y': [2]*100,'z': [3]*100,})
> # write it to file
> pq.write_table(a, 'test.parquet')
> # read the same file
> b = pq.read_table('test.parquet')
> # a == b is True, that's good
> # write table b to file
> pq.write_table(b, 'test2.parquet')
> # test is different from test2{code}
> Basically it is:
>  * Create table in memory
>  * Write it to file
>  * Read it again
>  * Write it to a different file
> The files are not the same. The second one contains extra information.
> The differences are consistent across different compressions (I tried snappy 
> and zstd).
> Also, reading the second file and and writing it again, produces the same 
> file.
> Is this a bug or an expected behavior?
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14254) [C++] Return a random sample of rows from a query

2021-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14254:

Labels: kernel query-engine  (was: kernel)

> [C++] Return a random sample of rows from a query
> -
>
> Key: ARROW-14254
> URL: https://issues.apache.org/jira/browse/ARROW-14254
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Priority: Major
>  Labels: kernel, query-engine
> Fix For: 7.0.0
>
>
> Please can we have a kernel that returns a random sample of rows? We've had a 
> request to be able to do this in R: 
> https://github.com/apache/arrow-cookbook/issues/83



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14254) [C++] Return a random sample of rows from a query

2021-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14254:

Fix Version/s: 7.0.0

> [C++] Return a random sample of rows from a query
> -
>
> Key: ARROW-14254
> URL: https://issues.apache.org/jira/browse/ARROW-14254
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Priority: Major
>  Labels: kernel
> Fix For: 7.0.0
>
>
> Please can we have a kernel that returns a random sample of rows? We've had a 
> request to be able to do this in R: 
> https://github.com/apache/arrow-cookbook/issues/83



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14257) [Doc][Python] dataset doc build fails

2021-10-08 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426331#comment-17426331
 ] 

Joris Van den Bossche commented on ARROW-14257:
---

But can you only run into this error in a _writing_ context? 

> [Doc][Python] dataset doc build fails
> -
>
> Key: ARROW-14257
> URL: https://issues.apache.org/jira/browse/ARROW-14257
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Python
>Reporter: Antoine Pitrou
>Assignee: Joris Van den Bossche
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> {code}
> >>>-
> Exception in /home/antoine/arrow/dev/docs/source/python/dataset.rst at block 
> ending on line 578
> Specify :okexcept: as an option in the ipython:: block to suppress this 
> message
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 ds.write_dataset(scanner, new_root, format="parquet", 
> partitioning=new_part)
> ~/arrow/dev/python/pyarrow/dataset.py in write_dataset(data, base_dir, 
> basename_template, format, partitioning, partitioning_flavor, schema, 
> filesystem, file_options, use_threads, max_partitions, file_visitor)
> 861 _filesystemdataset_write(
> 862 scanner, base_dir, basename_template, filesystem, 
> partitioning,
> --> 863 file_options, max_partitions, file_visitor
> 864 )
> ~/arrow/dev/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> ~/arrow/dev/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowNotImplementedError: Asynchronous scanning is not supported by 
> SyncScanner
> /home/antoine/arrow/dev/cpp/src/arrow/dataset/file_base.cc:367  
> scanner->ScanBatchesAsync()
> <<<-
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8033) [Go][Integration] Enable custom_metadata integtration test

2021-10-08 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-8033.
--
  Assignee: Matthew Topol
Resolution: Fixed

implementation of the extension type added support for the metadata properly. 
so resolving this as Go is no longer skipped for the custom metadata tests

> [Go][Integration] Enable custom_metadata integtration test
> --
>
> Key: ARROW-8033
> URL: https://issues.apache.org/jira/browse/ARROW-8033
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go, Integration
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Matthew Topol
>Priority: Major
>
> https://github.com/apache/arrow/pull/6556 adds an integration test including 
> custom metadata but Go is skipped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14262) [C++] Document and rename is_in_meta_binary

2021-10-08 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426326#comment-17426326
 ] 

Weston Pace commented on ARROW-14262:
-

Also, we will need to add them to compute.rst

> [C++] Document and rename is_in_meta_binary
> ---
>
> Key: ARROW-14262
> URL: https://issues.apache.org/jira/browse/ARROW-14262
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> The is_in_meta_binary and index_in_meta_binary functions do not have any 
> "_doc" elements.  I had simply ignored them assuming they were some kind of 
> specialized function that shouldn't be exposed for general consumption (see 
> ARROW-13949) but I recently discovered they are legitimate binary variants of 
> their unary counterparts.
> If we want to continue to expose these functions we should rename them (meta 
> I assume means meta function but the python/r user has no idea what a meta 
> function is) and add _doc elements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14262) [C++] Document and rename is_in_meta_binary

2021-10-08 Thread Weston Pace (Jira)
Weston Pace created ARROW-14262:
---

 Summary: [C++] Document and rename is_in_meta_binary
 Key: ARROW-14262
 URL: https://issues.apache.org/jira/browse/ARROW-14262
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace


The is_in_meta_binary and index_in_meta_binary functions do not have any "_doc" 
elements.  I had simply ignored them assuming they were some kind of 
specialized function that shouldn't be exposed for general consumption (see 
ARROW-13949) but I recently discovered they are legitimate binary variants of 
their unary counterparts.

If we want to continue to expose these functions we should rename them (meta I 
assume means meta function but the python/r user has no idea what a meta 
function is) and add _doc elements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13984) [Go][Parquet] Add File Package - readers

2021-10-08 Thread Matthew Topol (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426314#comment-17426314
 ] 

Matthew Topol commented on ARROW-13984:
---

[~rick...@x14.se] as long as you are in a module (a directory that has a go.mod 
or whose parent up the chain has a go.mod, you can create a module by running 
`go mod init ` in a directory) you should be able to run `go get 
github.com/apache/arrow/go/parquet` to download the parquet package.

After that, you'll need to use a replace directive in order to test my branch 
with the reader which you can do by running `go mod edit 
-replace=github.com/apache/arrow/go/parquet=github.com/zeroshade/arrow/go/parquet@goparquet-file`
 i believe. After that you should be able to just import it normally in a .go 
file by using `import "github.com/apache/arrow/go/parquet"` and so on. Let me 
know if you run into any issues.

> [Go][Parquet] Add File Package - readers
> 
>
> Key: ARROW-13984
> URL: https://issues.apache.org/jira/browse/ARROW-13984
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go, Parquet
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> Add the package for manipulating files directly, column reader/writer, file 
> reader/writer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13984) [Go][Parquet] Add File Package - readers

2021-10-08 Thread Rickard Lundin (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426312#comment-17426312
 ] 

Rickard Lundin commented on ARROW-13984:


This is bigger than to see see a Panda being born!I wish i could figure out how 
to test it. Is it just clone from git and build the whole arrow package? I will 
try to find the branch name.

/Rickard a newborn Golanger

> [Go][Parquet] Add File Package - readers
> 
>
> Key: ARROW-13984
> URL: https://issues.apache.org/jira/browse/ARROW-13984
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go, Parquet
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Add the package for manipulating files directly, column reader/writer, file 
> reader/writer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14069) [R] By default, filter out hash functions in list_compute_functions()

2021-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-14069.
-
Resolution: Fixed

Issue resolved by pull request 11363
[https://github.com/apache/arrow/pull/11363]

> [R] By default, filter out hash functions in list_compute_functions()
> -
>
> Key: ARROW-14069
> URL: https://issues.apache.org/jira/browse/ARROW-14069
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Critical
>  Labels: good-first-issue, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> As users can't call hash functions directly in {{list_compute_functions()}}, 
> we should filter those out so they're not displayed.  Perhaps via a parameter 
> if we still need those for our internal uses of {{list_compute_functions()}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13866) [R] Implement Options for all compute kernels available via list_compute_functions

2021-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13866.
-
Resolution: Fixed

> [R] Implement Options for all compute kernels available via 
> list_compute_functions
> --
>
> Key: ARROW-13866
> URL: https://issues.apache.org/jira/browse/ARROW-13866
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
> Fix For: 6.0.0
>
>
> Not all of the compute kernels available via {{list_compute_functions()}} are 
> actually available to use in R, as they haven't been hooked up to the 
> relevant Options class in {{r/src/compute.cpp}}. 
> We should:
>  # Implement all remaining options classes
>  # Go through all the kernels listed by {{list_compute_functions()}} and 
> check that they have either no options classes to implement or that they have 
> been hooked up to the appropriate options class
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13901) [R] Implement IndexOptions

2021-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13901.
-
Resolution: Fixed

Issue resolved by pull request 11357
[https://github.com/apache/arrow/pull/11357]

> [R] Implement IndexOptions
> --
>
> Key: ARROW-13901
> URL: https://issues.apache.org/jira/browse/ARROW-13901
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13924) [R] Bindings for stringr::str_starts, stringr::str_ends, base::startsWith and base::endsWith

2021-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13924.
-
Resolution: Fixed

Issue resolved by pull request 11365
[https://github.com/apache/arrow/pull/11365]

> [R] Bindings for stringr::str_starts, stringr::str_ends, base::startsWith and 
> base::endsWith
> 
>
> Key: ARROW-13924
> URL: https://issues.apache.org/jira/browse/ARROW-13924
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Critical
>  Labels: good-first-issue, kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13893) [R] Make head/tail lazy on datasets and queries

2021-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13893:
---
Labels: pull-request-available query-engine  (was: query-engine)

> [R] Make head/tail lazy on datasets and queries
> ---
>
> Key: ARROW-13893
> URL: https://issues.apache.org/jira/browse/ARROW-13893
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Blocker
>  Labels: pull-request-available, query-engine
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13797) [C++] Implement column projection pushdown to ORC reader in Datasets API

2021-10-08 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13797:
--
Fix Version/s: 6.0.0

> [C++] Implement column projection pushdown to ORC reader in Datasets API
> 
>
> Key: ARROW-13797
> URL: https://issues.apache.org/jira/browse/ARROW-13797
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, orc, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support 
> for ORC file format in the Datasets API, but the reader still reads all 
> columns regardless of the ScanOptions. Since ORC is a columnar format that 
> supports reading only specific fields, we can optimize this step. 
> The tricky part is to convert the field name of the Arrow schema to the index 
> in the ORC schema. Currently, this logic is included in the Python bindings 
> (https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59),
>  but so this needs to be moved to C++.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13797) [C++] Implement column projection pushdown to ORC reader in Datasets API

2021-10-08 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-13797:
-

Assignee: Joris Van den Bossche

> [C++] Implement column projection pushdown to ORC reader in Datasets API
> 
>
> Key: ARROW-13797
> URL: https://issues.apache.org/jira/browse/ARROW-13797
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, orc, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support 
> for ORC file format in the Datasets API, but the reader still reads all 
> columns regardless of the ScanOptions. Since ORC is a columnar format that 
> supports reading only specific fields, we can optimize this step. 
> The tricky part is to convert the field name of the Arrow schema to the index 
> in the ORC schema. Currently, this logic is included in the Python bindings 
> (https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59),
>  but so this needs to be moved to C++.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13797) [C++] Implement column projection pushdown to ORC reader in Datasets API

2021-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13797:
---
Labels: dataset orc pull-request-available  (was: dataset orc)

> [C++] Implement column projection pushdown to ORC reader in Datasets API
> 
>
> Key: ARROW-13797
> URL: https://issues.apache.org/jira/browse/ARROW-13797
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, orc, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support 
> for ORC file format in the Datasets API, but the reader still reads all 
> columns regardless of the ScanOptions. Since ORC is a columnar format that 
> supports reading only specific fields, we can optimize this step. 
> The tricky part is to convert the field name of the Arrow schema to the index 
> in the ORC schema. Currently, this logic is included in the Python bindings 
> (https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59),
>  but so this needs to be moved to C++.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13887) [R] Capture error produced when reading in CSV file with headers and using a schema, and add suggestion

2021-10-08 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426224#comment-17426224
 ] 

Dragoș Moldovan-Grünfeld commented on ARROW-13887:
--

There is a suggestion from [~westonpace] to still address this issue - i.e. 
capture the error and give some useful information to the reader. 
We can then create a separate issue for future on the schema vs col_types 
issue. FWIW I'm happy with his suggestion. Until we solve the underlying issues 
a more informative message might be useful.
What do you think [~npr] [~jonkeane] [~thisisnic]?

> [R] Capture error produced when reading in CSV file with headers and using a 
> schema, and add suggestion
> ---
>
> Key: ARROW-13887
> URL: https://issues.apache.org/jira/browse/ARROW-13887
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: good-first-issue
> Fix For: 6.0.0
>
>
> When reading in a CSV with headers, and also using a schema, we get an error 
> as the code tries to read in the header as a line of data.
> {code:java}
> share_data <- tibble::tibble(
>   company = c("AMZN", "GOOG", "BKNG", "TSLA"),
>   price = c(3463.12, 2884.38, 2300.46, 732.39)
> )
> readr::write_csv(share_data, file = "share_data.csv")
> share_schema <- schema(
>   company = utf8(),
>   price = float64()
> )
> read_csv_arrow("share_data.csv", schema = share_schema)
> {code}
> {code:java}
> Error: Invalid: In CSV column #1: CSV conversion error to double: invalid 
> value 'price'
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:492 decoder_.Decode(data, 
> size, quoted, &value)
> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:84 status
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:496 
> parser.VisitColumn(col_index, visit) {code}
> The correct thing here would have been for the user to supply the argument 
> {{skip=1}} to {{read_csv_arrow()}} but this is not immediately obvious from 
> the error message returned from C++.  We should capture the error and instead 
> supply our own error message using {{rlang::abort}} which informs the user of 
> the error and then suggests what they can do to prevent it.
>  
> For similar examples (and their associated PRs) see 
> {color:#1d1c1d}ARROW-11766, and ARROW-12791{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14252) [R] Partial matching of arguments warning

2021-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14252:
---
Labels: pull-request-available  (was: )

> [R] Partial matching of arguments warning
> -
>
> Key: ARROW-14252
> URL: https://issues.apache.org/jira/browse/ARROW-14252
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are a few examples of partially matched arguments in the code.  One 
> example is below, but there could be others.
> {code:r}
> Failure (test-dplyr-query.R:46:3): dim() on query
> `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> record_batch(tbl` threw an unexpected warning.
> Message: partial match of 'filtered' to 'filtered_rows'
> Class:   simpleWarning/warning/condition
> Backtrace:
>   1. arrow:::expect_dplyr_equal(...) test-dplyr-query.R:46:2
>  11. arrow::dim.arrow_dplyr_query(.)
>  12. base::isTRUE(x$filtered) /Users/dragos/Documents/arrow/r/R/dplyr.R:147:2
> Failure (test-dplyr-query.R:46:3): dim() on query
> `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> Table$create(tbl` threw an unexpected warning.
> Message: partial match of 'filtered' to 'filtered_rows'
> Class:   simpleWarning/warning/condition
> Backtrace:
>   1. arrow:::expect_dplyr_equal(...) test-dplyr-query.R:46:2
>  11. arrow::dim.arrow_dplyr_query(.)
>  12. base::isTRUE(x$filtered) /Users/dragos/Documents/arrow/r/R/dplyr.R:147:2
> {code}
> This is the relevant line of code in the example above: 
> https://github.com/apache/arrow/blob/25a6f591d1f162106b74e29870ebd4012e9874cc/r/R/dplyr.R#L150



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13988) [C++] Support binary-like types in hash_min_max, hash_min, hash_max

2021-10-08 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-13988:


Assignee: David Li

> [C++] Support binary-like types in hash_min_max, hash_min, hash_max
> ---
>
> Key: ARROW-13988
> URL: https://issues.apache.org/jira/browse/ARROW-13988
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel
>
> An extension to ARROW-13882. Non-fixed-width types will need a separate 
> approach, so this was split out to a new JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13947) [C++] index kernel missing support for decimal, null, and fixed_size_binary

2021-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13947:
---
Labels: kernel pull-request-available query-engine  (was: kernel 
query-engine)

> [C++] index kernel missing support for decimal, null, and fixed_size_binary
> ---
>
> Key: ARROW-13947
> URL: https://issues.apache.org/jira/browse/ARROW-13947
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available, query-engine
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The "index" kernel should support any equatable type.  At the moment it does 
> not support decimal, fixed_size_binary, or null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13947) [C++] index kernel missing support for decimal, null, and fixed_size_binary

2021-10-08 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-13947:


Assignee: David Li

> [C++] index kernel missing support for decimal, null, and fixed_size_binary
> ---
>
> Key: ARROW-13947
> URL: https://issues.apache.org/jira/browse/ARROW-13947
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: David Li
>Priority: Major
>  Labels: kernel, query-engine
>
> The "index" kernel should support any equatable type.  At the moment it does 
> not support decimal, fixed_size_binary, or null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14213) R arrow package not working on RStudio/Ubuntu

2021-10-08 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426206#comment-17426206
 ] 

Neal Richardson commented on ARROW-14213:
-

The build output says:

{code}
Binary package requires libcurl and openssl
If installation fails, retry after installing those system requirements
{code}

Can you install those and retry?

> R arrow package not working on RStudio/Ubuntu
> -
>
> Key: ARROW-14213
> URL: https://issues.apache.org/jira/browse/ARROW-14213
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: R version 3.6.3 (2020-02-29) -- "Holding the Windsock"
> Copyright (C) 2020 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
>Reporter: Thomas Wutzler
>Priority: Major
>
> I try reading feather files in R with the arrow package that were generated 
> in Python.
> I run on R 3.6.3 on an RStudio server window on  linux machine, for which I 
> have no other access. I get the message:
>  {{Cannot call io___MemoryMappedFile__Open().}}
> According to the advice in the linked help-file: 
> [https://cran.r-project.org/web/packages/arrow/vignettes/install.html] I 
> create this issue with the full log of the installation:
> {{}}
> > arrow::install_arrow(verbose = TRUE)Installing package into 
> > '/Net/Groups/BGI/scratch/twutz/R/atacama-library/3.6'
> (as 'lib' is unspecified)trying URL 
> 'https://ftp5.gwdg.de/pub/misc/cran/src/contrib/arrow_5.0.0.2.tar.gz'Content 
> type 'application/octet-stream' length 483642 bytes (472 
> KB)==downloaded 472 KB* 
> installing *source* package 'arrow' ...** package 'arrow' successfully 
> unpacked and MD5 sums checked** using staged installationtrying URL 
> 'https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/ubuntu-16.04/arrow-5.0.0.2.zip'Content
>  type 'binary/octet-stream' length 17214781 bytes (16.4 
> MB)==downloaded 16.4 MB*** 
> Successfully retrieved C++ binaries for ubuntu-16.04
>  Binary package requires libcurl and openssl
>  If installation fails, retry after installing those system requirements
> PKG_CFLAGS=-I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include
>   -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3
> PKG_LIBS=-L/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/lib
>  -larrow_dataset -lparquet -larrow -larrow -larrow_bundled_dependencies 
> -larrow_dataset -lparquet -lssl -lcrypto -lcurl** libsg++ -std=gnu++11 
> -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/include/-fpic  -g -O2 
> -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
> -D_FORTIFY_SOURCE=2 -g  -c RTasks.cpp -o RTasks.o
> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/include/-fpic  -g -O2 
> -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
> -D_FORTIFY_SOURCE=2 -g  -c altrep.cpp -o altrep.o
> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/include/-fpic  -g -O2 
> -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
> -D_FORTIFY_SOURCE=2 -g  -c array.cpp -o array.o
> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/include/-fpic  -g -O2 
> -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
> -D_FORTIFY_SOURCE=2 -g  -c array_to_vector.cpp -o array_to_vector.o
> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/include/-fpic  -g -O2 
> -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
> -D_FORTIFY_SOURCE=2 -g  -c arraydata.cpp -o arraydata.o
> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG 
> -I/tmp/RtmpXvu6Oc/R.INSTALL1451f6ede9ea2/arrow/libarrow/arrow-5.0.0.2/include 
>  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET 
> -DARROW_R_WITH_S3 -I../inst/includ

[jira] [Updated] (ARROW-13111) [R] altrep vectors for ChunkedArray

2021-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13111:
---
Labels: pull-request-available  (was: )

> [R] altrep vectors for ChunkedArray
> ---
>
> Key: ARROW-13111
> URL: https://issues.apache.org/jira/browse/ARROW-13111
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Romain Francois
>Assignee: Romain Francois
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13948) [C++] index_in/is_in kernels missing support for timestamp with timezone

2021-10-08 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-13948:


Assignee: David Li

> [C++] index_in/is_in kernels missing support for timestamp with timezone
> 
>
> Key: ARROW-13948
> URL: https://issues.apache.org/jira/browse/ARROW-13948
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: David Li
>Priority: Major
>  Labels: good-first-issue, kernel, pull-request-available, 
> query-engine
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The index_in and is_in kernels should support all equatable value types.  At 
> the moment it supports all except for timestamp types that have a timezone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13948) [C++] index_in/is_in kernels missing support for timestamp with timezone

2021-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13948:
---
Labels: good-first-issue kernel pull-request-available query-engine  (was: 
good-first-issue kernel query-engine)

> [C++] index_in/is_in kernels missing support for timestamp with timezone
> 
>
> Key: ARROW-13948
> URL: https://issues.apache.org/jira/browse/ARROW-13948
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: good-first-issue, kernel, pull-request-available, 
> query-engine
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The index_in and is_in kernels should support all equatable value types.  At 
> the moment it supports all except for timestamp types that have a timezone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-08 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-14196:
--
Fix Version/s: 6.0.0

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 6.0.0
>
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-08 Thread Jim Pivarski (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426167#comment-17426167
 ] 

Jim Pivarski commented on ARROW-14196:
--

If it's changing a default, I can just set the option explicitly, right?

If it's changing how columns are named in the file (currently 
"fieldA.list.fieldB.list", etc.), then that would require adjustment on my 
side, but it would be an adjustment that depends on how the file was written, 
not the pyarrow version, right? If so, I'd have to support both naming 
conventions because both would exist in the wild.

If there is a change that Awkward Array has to adjust to, then now may be a 
good time to do it because we're going to be rewriting the "from_parquet" 
function soon, as part of integrating with Dask

https://github.com/ContinuumIO/dask-awkward

adopting fsspec, and using pyarrow's Dataset API (to replace our manual 
implementation). If there's something that we can set to get the new behavior, 
we can turn that on now while developing the new version and be ready for the 
change.

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14259) [R] converting from R vector to Array when the R vector is altrep

2021-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14259:
---
Labels: pull-request-available  (was: )

> [R] converting from R vector to Array when the R vector is altrep
> -
>
> Key: ARROW-14259
> URL: https://issues.apache.org/jira/browse/ARROW-14259
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Romain Francois
>Assignee: Romain Francois
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When we have an R vector that was created from an Array with altrep, and then 
> we want to convert again to an Array, currently it materializes it, and it 
> should not. Instead it should be grabbing the array from the internals -of 
> the altrep object. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14257) [Doc][Python] dataset doc build fails

2021-10-08 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426081#comment-17426081
 ] 

Weston Pace commented on ARROW-14257:
-

In Python it is always use_async=True.  In R the scanner is hidden from the 
user on dataset writes but the option there is use_async as well.  In C++ the 
option is UseAsync in the ScannerBuilder.  How about,

"Writing datasets requires that the input scanner is configured to scan 
asynchronously via the use_async or UseAsync options."

> [Doc][Python] dataset doc build fails
> -
>
> Key: ARROW-14257
> URL: https://issues.apache.org/jira/browse/ARROW-14257
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Python
>Reporter: Antoine Pitrou
>Assignee: Joris Van den Bossche
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> {code}
> >>>-
> Exception in /home/antoine/arrow/dev/docs/source/python/dataset.rst at block 
> ending on line 578
> Specify :okexcept: as an option in the ipython:: block to suppress this 
> message
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 ds.write_dataset(scanner, new_root, format="parquet", 
> partitioning=new_part)
> ~/arrow/dev/python/pyarrow/dataset.py in write_dataset(data, base_dir, 
> basename_template, format, partitioning, partitioning_flavor, schema, 
> filesystem, file_options, use_threads, max_partitions, file_visitor)
> 861 _filesystemdataset_write(
> 862 scanner, base_dir, basename_template, filesystem, 
> partitioning,
> --> 863 file_options, max_partitions, file_visitor
> 864 )
> ~/arrow/dev/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> ~/arrow/dev/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowNotImplementedError: Asynchronous scanning is not supported by 
> SyncScanner
> /home/antoine/arrow/dev/cpp/src/arrow/dataset/file_base.cc:367  
> scanner->ScanBatchesAsync()
> <<<-
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13924) [R] Bindings for stringr::str_starts, stringr::str_ends, base::startsWith and base::endsWith

2021-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13924:
---
Labels: good-first-issue kernel pull-request-available  (was: 
good-first-issue kernel)

> [R] Bindings for stringr::str_starts, stringr::str_ends, base::startsWith and 
> base::endsWith
> 
>
> Key: ARROW-13924
> URL: https://issues.apache.org/jira/browse/ARROW-13924
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Critical
>  Labels: good-first-issue, kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14257) [Doc][Python] dataset doc build fails

2021-10-08 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426037#comment-17426037
 ] 

Joris Van den Bossche commented on ARROW-14257:
---

The error message is indeed not very helpful. We could mention something about 
using the {{use_async}} scan option, although I am not fully sure it would be 
applicable in all cases where you can run into this error?

> [Doc][Python] dataset doc build fails
> -
>
> Key: ARROW-14257
> URL: https://issues.apache.org/jira/browse/ARROW-14257
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Python
>Reporter: Antoine Pitrou
>Assignee: Joris Van den Bossche
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code}
> >>>-
> Exception in /home/antoine/arrow/dev/docs/source/python/dataset.rst at block 
> ending on line 578
> Specify :okexcept: as an option in the ipython:: block to suppress this 
> message
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 ds.write_dataset(scanner, new_root, format="parquet", 
> partitioning=new_part)
> ~/arrow/dev/python/pyarrow/dataset.py in write_dataset(data, base_dir, 
> basename_template, format, partitioning, partitioning_flavor, schema, 
> filesystem, file_options, use_threads, max_partitions, file_visitor)
> 861 _filesystemdataset_write(
> 862 scanner, base_dir, basename_template, filesystem, 
> partitioning,
> --> 863 file_options, max_partitions, file_visitor
> 864 )
> ~/arrow/dev/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> ~/arrow/dev/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowNotImplementedError: Asynchronous scanning is not supported by 
> SyncScanner
> /home/antoine/arrow/dev/cpp/src/arrow/dataset/file_base.cc:367  
> scanner->ScanBatchesAsync()
> <<<-
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14257) [Doc][Python] dataset doc build fails

2021-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14257:
---
Labels: pull-request-available  (was: )

> [Doc][Python] dataset doc build fails
> -
>
> Key: ARROW-14257
> URL: https://issues.apache.org/jira/browse/ARROW-14257
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Python
>Reporter: Antoine Pitrou
>Assignee: Joris Van den Bossche
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code}
> >>>-
> Exception in /home/antoine/arrow/dev/docs/source/python/dataset.rst at block 
> ending on line 578
> Specify :okexcept: as an option in the ipython:: block to suppress this 
> message
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 ds.write_dataset(scanner, new_root, format="parquet", 
> partitioning=new_part)
> ~/arrow/dev/python/pyarrow/dataset.py in write_dataset(data, base_dir, 
> basename_template, format, partitioning, partitioning_flavor, schema, 
> filesystem, file_options, use_threads, max_partitions, file_visitor)
> 861 _filesystemdataset_write(
> 862 scanner, base_dir, basename_template, filesystem, 
> partitioning,
> --> 863 file_options, max_partitions, file_visitor
> 864 )
> ~/arrow/dev/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> ~/arrow/dev/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowNotImplementedError: Asynchronous scanning is not supported by 
> SyncScanner
> /home/antoine/arrow/dev/cpp/src/arrow/dataset/file_base.cc:367  
> scanner->ScanBatchesAsync()
> <<<-
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-14257) [Doc][Python] dataset doc build fails

2021-10-08 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-14257:
-

Assignee: Joris Van den Bossche

> [Doc][Python] dataset doc build fails
> -
>
> Key: ARROW-14257
> URL: https://issues.apache.org/jira/browse/ARROW-14257
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Python
>Reporter: Antoine Pitrou
>Assignee: Joris Van den Bossche
>Priority: Blocker
> Fix For: 6.0.0
>
>
> {code}
> >>>-
> Exception in /home/antoine/arrow/dev/docs/source/python/dataset.rst at block 
> ending on line 578
> Specify :okexcept: as an option in the ipython:: block to suppress this 
> message
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 ds.write_dataset(scanner, new_root, format="parquet", 
> partitioning=new_part)
> ~/arrow/dev/python/pyarrow/dataset.py in write_dataset(data, base_dir, 
> basename_template, format, partitioning, partitioning_flavor, schema, 
> filesystem, file_options, use_threads, max_partitions, file_visitor)
> 861 _filesystemdataset_write(
> 862 scanner, base_dir, basename_template, filesystem, 
> partitioning,
> --> 863 file_options, max_partitions, file_visitor
> 864 )
> ~/arrow/dev/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> ~/arrow/dev/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowNotImplementedError: Asynchronous scanning is not supported by 
> SyncScanner
> /home/antoine/arrow/dev/cpp/src/arrow/dataset/file_base.cc:367  
> scanner->ScanBatchesAsync()
> <<<-
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13924) [R] Bindings for stringr::str_starts, stringr::str_ends, base::startsWith and base::endsWith

2021-10-08 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane reassigned ARROW-13924:


Assignee: Nicola Crane

> [R] Bindings for stringr::str_starts, stringr::str_ends, base::startsWith and 
> base::endsWith
> 
>
> Key: ARROW-13924
> URL: https://issues.apache.org/jira/browse/ARROW-13924
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Critical
>  Labels: good-first-issue, kernel
> Fix For: 6.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14069) [R] By default, filter out hash functions in list_compute_functions()

2021-10-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14069:
---
Labels: good-first-issue pull-request-available  (was: good-first-issue)

> [R] By default, filter out hash functions in list_compute_functions()
> -
>
> Key: ARROW-14069
> URL: https://issues.apache.org/jira/browse/ARROW-14069
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Critical
>  Labels: good-first-issue, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As users can't call hash functions directly in {{list_compute_functions()}}, 
> we should filter those out so they're not displayed.  Perhaps via a parameter 
> if we still need those for our internal uses of {{list_compute_functions()}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)