[jira] [Comment Edited] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data

2023-01-05 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655102#comment-17655102
 ] 

Joris Van den Bossche edited comment on ARROW-18400 at 1/5/23 7:08 PM:
---

Yes, I think it has to use {{Flatten()}} instead of {{values()}}, that is the 
proper method to get 'flat' values taking into account offset etc (i.e. that 
not just returns the underlying memory).

 

The {{.values()}} is really an easy way to shoot yourself in the foot .. I 
remember similar issues related to pyarrow as we expose those two methods on 
pyarrow.ListArray as well.


was (Author: jorisvandenbossche):
Yes, I think it has to use {{Flatten()}} instead of {{values()}}, that is the 
proper method to get 'flat' values taking into account offset etc (i.e. that 
not just returns the underlying memory).

 

The {{.values()}} is really an easy way to shoot yourself in the foot .. I 
remember similar issues related to pyarrow as we expose those two methods as 
well.

> [Python] Quadratic memory usage of Table.to_pandas with nested data
> ---
>
> Key: ARROW-18400
> URL: https://issues.apache.org/jira/browse/ARROW-18400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>Reporter: Adam Reeve
>Assignee: Alenka Frim
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
> Attachments: test_memory.py
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(1, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } for i in range(arr_len)]
>         ))
> df = pd.DataFrame({'c1': nested_col})
> df.to_parquet(filename)
> {code}
> And then read into a DataFrame with:
> {code:python}
> import pyarrow.parquet as pq
> table = pq.read_table(filename)
> df = table.to_pandas()
> {code}
> Only reading to an Arrow table isn't a problem, it's the to_pandas method 
> that exhibits the large memory usage. I haven't tested generating nested 
> Arrow data in memory without writing Parquet from Pandas but I assume the 
> problem probably isn't Parquet specific.
> Memory usage I see when reading different sized files on a machine with 64 GB 
> RAM:
> ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
> |32,000|362|361|
> |64,000|531|531|
> |128,000|1,152|1,101|
> |256,000|2,888|1,402|
> |512,000|10,301|3,508|
> |1,024,000|38,697|5,313|
> |2,048,000|OOM|20,061|
> |4,096,000| |OOM|
> With Arrow 10.0.1, memory usage approximately quadruples when row count 
> doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but 
> then quadruples from 1024k to 2048k rows.
> PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
> changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data

2023-01-05 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655102#comment-17655102
 ] 

Joris Van den Bossche edited comment on ARROW-18400 at 1/5/23 7:07 PM:
---

Yes, I think it has to use {{Flatten()}} instead of {{values())}, that is the 
proper method to get 'flat' values taking into account offset etc (i.e. that 
not just returns the underlying memory).

 

The {{.values()}} is really an easy way to shoot yourself in the foot .. I 
remember similar issues related to pyarrow as we expose those two methods as 
well.


was (Author: jorisvandenbossche):
Yes, I think it has to use {{Flatten()}} instead of \{{values())}, that is the 
proper method to get 'flat' values taking into account offset etc (i.e. that 
not just returns the underlying memory).

 

The {{.values()}} is really an easy way to shoot yourself in the foot .. I 
remember similar issues related to pyarrow as we expose those two methods as 
well.

> [Python] Quadratic memory usage of Table.to_pandas with nested data
> ---
>
> Key: ARROW-18400
> URL: https://issues.apache.org/jira/browse/ARROW-18400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>Reporter: Adam Reeve
>Assignee: Alenka Frim
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
> Attachments: test_memory.py
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(1, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } for i in range(arr_len)]
>         ))
> df = pd.DataFrame({'c1': nested_col})
> df.to_parquet(filename)
> {code}
> And then read into a DataFrame with:
> {code:python}
> import pyarrow.parquet as pq
> table = pq.read_table(filename)
> df = table.to_pandas()
> {code}
> Only reading to an Arrow table isn't a problem, it's the to_pandas method 
> that exhibits the large memory usage. I haven't tested generating nested 
> Arrow data in memory without writing Parquet from Pandas but I assume the 
> problem probably isn't Parquet specific.
> Memory usage I see when reading different sized files on a machine with 64 GB 
> RAM:
> ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
> |32,000|362|361|
> |64,000|531|531|
> |128,000|1,152|1,101|
> |256,000|2,888|1,402|
> |512,000|10,301|3,508|
> |1,024,000|38,697|5,313|
> |2,048,000|OOM|20,061|
> |4,096,000| |OOM|
> With Arrow 10.0.1, memory usage approximately quadruples when row count 
> doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but 
> then quadruples from 1024k to 2048k rows.
> PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
> changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data

2023-01-05 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655102#comment-17655102
 ] 

Joris Van den Bossche edited comment on ARROW-18400 at 1/5/23 7:07 PM:
---

Yes, I think it has to use {{Flatten()}} instead of {{values()}}, that is the 
proper method to get 'flat' values taking into account offset etc (i.e. that 
not just returns the underlying memory).

 

The {{.values()}} is really an easy way to shoot yourself in the foot .. I 
remember similar issues related to pyarrow as we expose those two methods as 
well.


was (Author: jorisvandenbossche):
Yes, I think it has to use {{Flatten()}} instead of {{values())}, that is the 
proper method to get 'flat' values taking into account offset etc (i.e. that 
not just returns the underlying memory).

 

The {{.values()}} is really an easy way to shoot yourself in the foot .. I 
remember similar issues related to pyarrow as we expose those two methods as 
well.

> [Python] Quadratic memory usage of Table.to_pandas with nested data
> ---
>
> Key: ARROW-18400
> URL: https://issues.apache.org/jira/browse/ARROW-18400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>Reporter: Adam Reeve
>Assignee: Alenka Frim
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
> Attachments: test_memory.py
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(1, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } for i in range(arr_len)]
>         ))
> df = pd.DataFrame({'c1': nested_col})
> df.to_parquet(filename)
> {code}
> And then read into a DataFrame with:
> {code:python}
> import pyarrow.parquet as pq
> table = pq.read_table(filename)
> df = table.to_pandas()
> {code}
> Only reading to an Arrow table isn't a problem, it's the to_pandas method 
> that exhibits the large memory usage. I haven't tested generating nested 
> Arrow data in memory without writing Parquet from Pandas but I assume the 
> problem probably isn't Parquet specific.
> Memory usage I see when reading different sized files on a machine with 64 GB 
> RAM:
> ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
> |32,000|362|361|
> |64,000|531|531|
> |128,000|1,152|1,101|
> |256,000|2,888|1,402|
> |512,000|10,301|3,508|
> |1,024,000|38,697|5,313|
> |2,048,000|OOM|20,061|
> |4,096,000| |OOM|
> With Arrow 10.0.1, memory usage approximately quadruples when row count 
> doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but 
> then quadruples from 1024k to 2048k rows.
> PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
> changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data

2023-01-05 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655102#comment-17655102
 ] 

Joris Van den Bossche commented on ARROW-18400:
---

Yes, I think it has to use {{Flatten()}} instead of \{{values())}, that is the 
proper method to get 'flat' values taking into account offset etc (i.e. that 
not just returns the underlying memory).

 

The {{.values()}} is really an easy way to shoot yourself in the foot .. I 
remember similar issues related to pyarrow as we expose those two methods as 
well.

> [Python] Quadratic memory usage of Table.to_pandas with nested data
> ---
>
> Key: ARROW-18400
> URL: https://issues.apache.org/jira/browse/ARROW-18400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>Reporter: Adam Reeve
>Assignee: Alenka Frim
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
> Attachments: test_memory.py
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(1, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } for i in range(arr_len)]
>         ))
> df = pd.DataFrame({'c1': nested_col})
> df.to_parquet(filename)
> {code}
> And then read into a DataFrame with:
> {code:python}
> import pyarrow.parquet as pq
> table = pq.read_table(filename)
> df = table.to_pandas()
> {code}
> Only reading to an Arrow table isn't a problem, it's the to_pandas method 
> that exhibits the large memory usage. I haven't tested generating nested 
> Arrow data in memory without writing Parquet from Pandas but I assume the 
> problem probably isn't Parquet specific.
> Memory usage I see when reading different sized files on a machine with 64 GB 
> RAM:
> ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
> |32,000|362|361|
> |64,000|531|531|
> |128,000|1,152|1,101|
> |256,000|2,888|1,402|
> |512,000|10,301|3,508|
> |1,024,000|38,697|5,313|
> |2,048,000|OOM|20,061|
> |4,096,000| |OOM|
> With Arrow 10.0.1, memory usage approximately quadruples when row count 
> doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but 
> then quadruples from 1024k to 2048k rows.
> PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
> changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16728) [Python] Switch default and deprecate use_legacy_dataset=True in ParquetDataset

2022-12-23 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-16728.
---
Resolution: Fixed

Issue resolved by pull request 14052
https://github.com/apache/arrow/pull/14052

> [Python] Switch default and deprecate use_legacy_dataset=True in 
> ParquetDataset
> ---
>
> Key: ARROW-16728
> URL: https://issues.apache.org/jira/browse/ARROW-16728
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The ParquetDataset() constructor itself still defaults to 
> {{use_legacy_dataset=True}} (although using specific attributes or keywords 
> related to that will raise a warning). So a next step will be to actually 
> deprecate passing that and switching the default, and then only afterwards we 
> can remove the code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18363) [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)

2022-12-23 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18363.
---
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14839
https://github.com/apache/arrow/pull/14839

> [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)
> -
>
> Key: ARROW-18363
> URL: https://issues.apache.org/jira/browse/ARROW-18363
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Joris Van den Bossche
>Assignee: Alenka Frim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Now we have versioned docs, we also have the old versions of the developers 
> docs (eg 
> https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those 
> might be outdated (eg regarding communication channels, build instructions, 
> etc), and typically when contributing / developing with the latest arrow, one 
> should _always_ check the latest dev version of the contributing docs.
> We could add a warning box pointing this out and linking to the dev docs. 
> For example similarly how some projects warn about viewing old docs in 
> general and point to the stable docs (eg https://mne.tools/1.1/index.html or 
> https://scikit-learn.org/1.0/user_guide.html). In this case we could have a 
> custom box when at a page in /developers to point to the dev docs instead of 
> stable docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16337) [Python] Expose parameter that determines to store Arrow schema in Parquet metadata in Python

2022-12-23 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-16337.
---
Resolution: Fixed

Issue resolved by pull request 13000
https://github.com/apache/arrow/pull/13000

> [Python] Expose parameter that determines to store Arrow schema in Parquet 
> metadata in Python
> -
>
> Key: ARROW-16337
> URL: https://issues.apache.org/jira/browse/ARROW-16337
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> There is a {{store_schema}} flag that determines whether we store the Arrow 
> schema in the Parquet metadata (under the {{ARROW:schema}} key) or not. This 
> is exposed in the C++, but not in the Python interface. It would be good to 
> also expose this in the Python layer, to more easily experiment with this (eg 
> to check the impact of having the schema available or not when reading a file)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18394) [CI][Python] Nightly pyhon pandas jobs using latest or upstream_devel fail

2022-12-22 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18394.
---
Resolution: Fixed

Issue resolved by pull request 15048
https://github.com/apache/arrow/pull/15048

> [CI][Python] Nightly pyhon pandas jobs using latest or upstream_devel fail
> --
>
> Key: ARROW-18394
> URL: https://issues.apache.org/jira/browse/ARROW-18394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Assignee: Joris Van den Bossche
>Priority: Critical
>  Labels: Nightly, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Currently the following jobs fail:
> |test-conda-python-3.8-pandas-nightly|https://github.com/ursacomputing/crossbow/actions/runs/3532562061/jobs/5927065343|
> |test-conda-python-3.9-pandas-upstream_devel|https://github.com/ursacomputing/crossbow/actions/runs/3532562477/jobs/5927066168|
> with:
> {code:java}
>   _ test_roundtrip_with_bytes_unicode[columns0] 
> __columns = [b'foo']    @pytest.mark.parametrize('columns', 
> ([b'foo'], ['foo']))
>     def test_roundtrip_with_bytes_unicode(columns):
>         df = pd.DataFrame(columns=columns)
>         table1 = pa.Table.from_pandas(df)
> >       table2 = 
> > pa.Table.from_pandas(table1.to_pandas())opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/tests/test_pandas.py:2867:
> >  
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> pyarrow/array.pxi:830: in pyarrow.lib._PandasConvertible.to_pandas
>     ???
> pyarrow/table.pxi:3908: in pyarrow.lib.Table._to_pandas
>     ???
> opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:819:
>  in table_to_blockmanager
>     columns = _deserialize_column_index(table, all_columns, column_indexes)
> opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:935:
>  in _deserialize_column_index
>     columns = _reconstruct_columns_from_metadata(columns, column_indexes)
> opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:1154:
>  in _reconstruct_columns_from_metadata
>     level = level.astype(dtype)
> opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:1029:
>  in astype
>     return Index(new_values, name=self.name, dtype=new_values.dtype, 
> copy=False)
> opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:518:
>  in __new__
>     klass = cls._dtype_to_subclass(arr.dtype)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ cls = , dtype = dtype('S3')    
> @final
>     @classmethod
>     def _dtype_to_subclass(cls, dtype: DtypeObj):
>         # Delay import for perf. 
> https://github.com/pandas-dev/pandas/pull/31423
>     
>         if isinstance(dtype, ExtensionDtype):
>             if isinstance(dtype, DatetimeTZDtype):
>                 from pandas import DatetimeIndex
>     
>                 return DatetimeIndex
>             elif isinstance(dtype, CategoricalDtype):
>                 from pandas import CategoricalIndex
>     
>                 return CategoricalIndex
>             elif isinstance(dtype, IntervalDtype):
>                 from pandas import IntervalIndex
>     
>                 return IntervalIndex
>             elif isinstance(dtype, PeriodDtype):
>                 from pandas import PeriodIndex
>     
>                 return PeriodIndex
>     
>             return Index
>     
>         if dtype.kind == "M":
>             from pandas import DatetimeIndex
>     
>             return DatetimeIndex
>     
>         elif dtype.kind == "m":
>             from pandas import TimedeltaIndex
>     
>             return TimedeltaIndex
>     
>         elif dtype.kind == "f":
>             from pandas.core.api import Float64Index
>     
>             return Float64Index
>         elif dtype.kind == "u":
>             from pandas.core.api import UInt64Index
>     
>             return UInt64Index
>         elif dtype.kind == "i":
>             from pandas.core.api import Int64Index
>     
>             return Int64Index
>     
>         elif dtype.kind == "O":
>             # NB: assuming away MultiIndex
>             return Index
>     
>         elif issubclass(
>             dtype.type, (str, bool, np.bool_, complex, np.complex64, 
> np.complex128)
>         ):
>             return Index
>     
> >       raise NotImplementedError(dtype)
> E       NotImplementedError: 
> |S3opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:595:
>  NotImplementedError{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18272) [pyarrow] ParquetFile does not recognize GCS cloud path as a string

2022-12-22 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18272.
---
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14717
https://github.com/apache/arrow/pull/14717

> [pyarrow] ParquetFile does not recognize GCS cloud path as a string
> ---
>
> Key: ARROW-18272
> URL: https://issues.apache.org/jira/browse/ARROW-18272
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.0
>Reporter: Zepu Zhang
>Assignee: Miles Granger
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> I have a Parquet file at
>  
> path = 'gs://mybucket/abc/d.parquet'
>  
> `pyarrow.parquet.read_metadata(path)` works fine.
>  
> `pyarrow.parquet.ParquetFile(path)` raises "Failed to open local file 
> 'gs://mybucket/abc/d.parquet'.
>  
> Looks like ParquetFile misses the path resolution logic found in 
> `read_metadata`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data

2022-12-22 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651169#comment-17651169
 ] 

Joris Van den Bossche commented on ARROW-18400:
---

A small reproducible example to illustrate my explanation above:

{code:python}
# creating a chunked list array that consists of two chunks that are both 
slices into the same parent array
arr = pa.array([[1, 2], [3, 4, 5], [6], [7, 8]])
chunked_arr = pa.chunked_array([arr.slice(0, 2), arr.slice(2, 2)])

# converting this chunked array to numpy
np_arr = chunked_arr.to_numpy()

# the list array gets converted to a numpy array of numpy arrays. Each element 
(the nested numpy array) is
# a slice of a numpy array of the flat values. We can get this parent flat 
numpy array through the .base property
>>> np_arr[0].base
array([[1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8]])

# the flat values are included twice. Comparing to the correct behaviour with 
original non-chunked array:
>>> arr.to_numpy(zero_copy_only=False)[0].base
array([[1, 2, 3, 4, 5, 6, 7, 8]])
{code}

> [Python] Quadratic memory usage of Table.to_pandas with nested data
> ---
>
> Key: ARROW-18400
> URL: https://issues.apache.org/jira/browse/ARROW-18400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>Reporter: Adam Reeve
>Assignee: Alenka Frim
>Priority: Critical
> Fix For: 11.0.0
>
> Attachments: test_memory.py
>
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(1, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } for i in range(arr_len)]
>         ))
> df = pd.DataFrame({'c1': nested_col})
> df.to_parquet(filename)
> {code}
> And then read into a DataFrame with:
> {code:python}
> import pyarrow.parquet as pq
> table = pq.read_table(filename)
> df = table.to_pandas()
> {code}
> Only reading to an Arrow table isn't a problem, it's the to_pandas method 
> that exhibits the large memory usage. I haven't tested generating nested 
> Arrow data in memory without writing Parquet from Pandas but I assume the 
> problem probably isn't Parquet specific.
> Memory usage I see when reading different sized files on a machine with 64 GB 
> RAM:
> ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
> |32,000|362|361|
> |64,000|531|531|
> |128,000|1,152|1,101|
> |256,000|2,888|1,402|
> |512,000|10,301|3,508|
> |1,024,000|38,697|5,313|
> |2,048,000|OOM|20,061|
> |4,096,000| |OOM|
> With Arrow 10.0.1, memory usage approximately quadruples when row count 
> doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but 
> then quadruples from 1024k to 2048k rows.
> PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
> changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data

2022-12-21 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650932#comment-17650932
 ] 

Joris Van den Bossche commented on ARROW-18400:
---

Using Alenka's script, I explored it a bit further, and noticed some things: 
when reading the parquet file using the dataset reader, it is using more memory 
to convert to pandas, but based on `memray` profiler, it didn't seem to take 
any other code path, all the code paths just allocate twice the memory (twice 
in my case, but for large dataset it might be x4 or x8 etc).  So there needed 
to be _something_ different about the table created from reading the parquet 
file with the legacy API vs dataset API. And it seems that with the dataset 
API, it is returning multiple chunks, but each of those chunks is actually a 
slice of a single buffer. And in the conversion layer, there is something not 
taking into account this offset into the buffer/.

Illustrating this, reading the parquet file in two ways (I have been using 
{{nrows = 1_024_000 // 4}}, so the file is a bit smaller, less chunks):

{code:python}
import pyarrow.parquet as pq
import pyarrow.dataset as ds

table1 = pq.read_table("memory_testing.parquet", use_legacy_dataset=True)
dataset = ds.dataset("memory_testing.parquet", format="parquet")
table2 = dataset.to_table()
{code}

Table 1 has a single chunk, while table 2 (from reading with dataset API) has 
two chunks:

{code}
>>> table1["c1"].num_chunks
1
>>> table2["c1"].num_chunks
2
{code}

Taking the first chunk of each of those, and then looking at those arrays:

{code:python}
arr1 = table1["c1"].chunk(0)
arr2 = table2["c1"].chunk(0)

>>> len(arr1)
256000
>>> len(arr2)  # around half the number of rows (since there are two chunks in 
>>> this table)
131072
>>> arr1.get_total_buffer_size()
110624012
>>> arr2.get_total_buffer_size()  # but still using the same total memory!
110624012
{code}

So the smaller chunk of table2 is not using less memory. That is because the 
two chunks of table2 are actually each a slice into the same underlying buffers:

{code:python}
>>> table2["c1"].chunk(0).buffers()[1]

>>> table2["c1"].chunk(1).buffers()[1]  # second chunk points to same memory 
>>> address and has same size as first chunk

>>> table2["c1"].chunk(1).offset  # and the second chunk has an offset to 
>>> account for that
131072
{code}

And somehow the conversion code for ListArray to numpy (which creates a numpy 
array of numpy arrays, by first creating one numpy array of the flat values, 
and then creating slices into that flat array) doesn't seem to take into 
account this offset, and ends up converting the full parent buffer twice (in my 
case twice, because of having 2 chunks, but this can grow quadratically).

---

The reason this happens for parquet and not for feather in this case, is 
because the Parquet file actually consists of a single row group (and I assume 
the dataset API will therefore still read that in one go, and then slice output 
batches from that to return the expected batch size in the dataset API), while 
the feather file already consists of multiple batches on disk (and thus doesn't 
result in sliced batches in memory).


> [Python] Quadratic memory usage of Table.to_pandas with nested data
> ---
>
> Key: ARROW-18400
> URL: https://issues.apache.org/jira/browse/ARROW-18400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>Reporter: Adam Reeve
>Assignee: Alenka Frim
>Priority: Critical
> Fix For: 11.0.0
>
> Attachments: test_memory.py
>
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(1, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } fo

[jira] [Updated] (ARROW-8891) [C++] Split non-cast compute kernels into a separate shared library

2022-12-16 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8891:
-
Priority: Critical  (was: Major)

> [C++] Split non-cast compute kernels into a separate shared library
> ---
>
> Key: ARROW-8891
> URL: https://issues.apache.org/jira/browse/ARROW-8891
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Critical
>
> Since we are going to implement a lot more precompiled kernels, I am not sure 
> it makes sense to require all of them to be compiled unconditionally just to 
> get access to {{compute::Cast}}, which is needed in many different contexts.
> After ARROW-8792 is merged, I would suggest creating a plugin hook for adding 
> a bundle of kernels from a shared library outside of libarrow.so, and then 
> moving all the object code outside of Cast to something like 
> libarrow_compute.so. Then we can change the CMake flags to compile Cast 
> kernels always (?) and then opt in to building the additional kernels package 
> separately



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18394) [CI][Python] Nightly pyhon pandas jobs using latest or upstream_devel fail

2022-12-08 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644867#comment-17644867
 ] 

Joris Van den Bossche commented on ARROW-18394:
---

For the failure shown above, this seems to be a regression on the pandas' side, 
and I opened an issue there to further discuss that: 
https://github.com/pandas-dev/pandas/issues/50127

> [CI][Python] Nightly pyhon pandas jobs using latest or upstream_devel fail
> --
>
> Key: ARROW-18394
> URL: https://issues.apache.org/jira/browse/ARROW-18394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Assignee: Joris Van den Bossche
>Priority: Critical
>  Labels: Nightly
> Fix For: 11.0.0
>
>
> Currently the following jobs fail:
> |test-conda-python-3.8-pandas-nightly|https://github.com/ursacomputing/crossbow/actions/runs/3532562061/jobs/5927065343|
> |test-conda-python-3.9-pandas-upstream_devel|https://github.com/ursacomputing/crossbow/actions/runs/3532562477/jobs/5927066168|
> with:
> {code:java}
>   _ test_roundtrip_with_bytes_unicode[columns0] 
> __columns = [b'foo']    @pytest.mark.parametrize('columns', 
> ([b'foo'], ['foo']))
>     def test_roundtrip_with_bytes_unicode(columns):
>         df = pd.DataFrame(columns=columns)
>         table1 = pa.Table.from_pandas(df)
> >       table2 = 
> > pa.Table.from_pandas(table1.to_pandas())opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/tests/test_pandas.py:2867:
> >  
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> pyarrow/array.pxi:830: in pyarrow.lib._PandasConvertible.to_pandas
>     ???
> pyarrow/table.pxi:3908: in pyarrow.lib.Table._to_pandas
>     ???
> opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:819:
>  in table_to_blockmanager
>     columns = _deserialize_column_index(table, all_columns, column_indexes)
> opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:935:
>  in _deserialize_column_index
>     columns = _reconstruct_columns_from_metadata(columns, column_indexes)
> opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:1154:
>  in _reconstruct_columns_from_metadata
>     level = level.astype(dtype)
> opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:1029:
>  in astype
>     return Index(new_values, name=self.name, dtype=new_values.dtype, 
> copy=False)
> opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:518:
>  in __new__
>     klass = cls._dtype_to_subclass(arr.dtype)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ cls = , dtype = dtype('S3')    
> @final
>     @classmethod
>     def _dtype_to_subclass(cls, dtype: DtypeObj):
>         # Delay import for perf. 
> https://github.com/pandas-dev/pandas/pull/31423
>     
>         if isinstance(dtype, ExtensionDtype):
>             if isinstance(dtype, DatetimeTZDtype):
>                 from pandas import DatetimeIndex
>     
>                 return DatetimeIndex
>             elif isinstance(dtype, CategoricalDtype):
>                 from pandas import CategoricalIndex
>     
>                 return CategoricalIndex
>             elif isinstance(dtype, IntervalDtype):
>                 from pandas import IntervalIndex
>     
>                 return IntervalIndex
>             elif isinstance(dtype, PeriodDtype):
>                 from pandas import PeriodIndex
>     
>                 return PeriodIndex
>     
>             return Index
>     
>         if dtype.kind == "M":
>             from pandas import DatetimeIndex
>     
>             return DatetimeIndex
>     
>         elif dtype.kind == "m":
>             from pandas import TimedeltaIndex
>     
>             return TimedeltaIndex
>     
>         elif dtype.kind == "f":
>             from pandas.core.api import Float64Index
>     
>             return Float64Index
>         elif dtype.kind == "u":
>             from pandas.core.api import UInt64Index
>     
>             return UInt64Index
>         elif dtype.kind == "i":
>             from pandas.core.api import Int64Index
>     
>             return Int64Index
>     
>         elif dtype.kind == "O":
>             # NB: assuming away MultiIndex
>             return Index
>     
>         elif issubclass(
>             dtype.type, (str, bool, np.bool_, complex, np.complex64, 
> np.complex128)
>         ):
>             return Index
>     
> >       raise NotImplementedError(dtype)
> E       NotImplementedError: 
> |S3opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:595:
>  NotImplementedError{code}



--

[jira] [Created] (ARROW-18428) [Website] Enable github issues on arrow-site repo

2022-12-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18428:
-

 Summary: [Website] Enable github issues on arrow-site repo
 Key: ARROW-18428
 URL: https://issues.apache.org/jira/browse/ARROW-18428
 Project: Apache Arrow
  Issue Type: Task
  Components: Website
Reporter: Joris Van den Bossche


Now we are moving to GitHub issues, it probably makes sense to open issues 
about the website in its own arrow-site repo, instead of keeping them in the 
main arrow repo.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-14799) [C++] Adding tabular pretty printing of Table / RecordBatch

2022-12-07 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644352#comment-17644352
 ] 

Joris Van den Bossche commented on ARROW-14799:
---

If we tackle this in C++, it might be worth checking out duckdb's 
implementation. If we decide to tackle this in the bindings, for Python it 
might be worth checking out ibis' implementation (using rich, they recently 
revamped there table representation, including support for nested columns).

> [C++] Adding tabular pretty printing of Table / RecordBatch
> ---
>
> Key: ARROW-14799
> URL: https://issues.apache.org/jira/browse/ARROW-14799
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>
> It would be nice to show a "preview" (eg xx number of first and last rows) of 
> a Table or RecordBatch in a traditional tabular form (like pandas DataFrames, 
> or R data.frame / tibbles have, or in a format that resembles markdown 
> tables). 
> This could also be added in the bindings, but we could also do it on the C++ 
> level to benefit multiple bindings at once.
> Based on a quick search, there is https://github.com/p-ranav/tabulate which 
> could be vendored (it has a single-include version).
> I suppose that nested data types could represent a challenge on how to 
> include those in a tabular format, though.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18123) [Python] Cannot use multi-byte characters in file names in write_table

2022-12-07 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18123.
---
Resolution: Fixed

Issue resolved by pull request 14764
https://github.com/apache/arrow/pull/14764

> [Python] Cannot use multi-byte characters in file names in write_table
> --
>
> Key: ARROW-18123
> URL: https://issues.apache.org/jira/browse/ARROW-18123
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: Miles Granger
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Error when specifying a file path containing multi-byte characters in 
> {{pyarrow.parquet.write_table}}.
> For example, use {{例.parquet}} as the file path.
> {code:python}
> Python 3.10.7 (main, Oct  5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pandas as pd
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
> ...'two': ['foo', 'bar', 'baz'],
> ...'three': [True, False, True]},
> ...index=list('abc'))
> >>> table = pa.Table.from_pandas(df)
> >>> import pyarrow.parquet as pq
> >>> pq.write_table(table, '例.parquet')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 2920, in write_table
> with ParquetWriter(
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 911, in __init__
> filesystem, path = _resolve_filesystem_and_path(
>   File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line
> 184, in _resolve_filesystem_and_path
> filesystem, path = FileSystem.from_uri(path)
>   File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18003) [Python] Add sort_by to RecordBatch

2022-12-07 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18003:
--
Labels: good-first-issue  (was: )

> [Python] Add sort_by to RecordBatch
> ---
>
> Key: ARROW-18003
> URL: https://issues.apache.org/jira/browse/ARROW-18003
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alessandro Molina
>Priority: Major
>  Labels: good-first-issue
> Fix For: 11.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18003) [Python] Add sort_by to RecordBatch

2022-12-07 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18003:
--
Summary: [Python] Add sort_by to RecordBatch  (was: [Python] Add sort_by to 
Table and RecordBatch)

> [Python] Add sort_by to RecordBatch
> ---
>
> Key: ARROW-18003
> URL: https://issues.apache.org/jira/browse/ARROW-18003
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alessandro Molina
>Priority: Major
> Fix For: 11.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18280) [C++][Python] Support slicing to arbitrary end in list_slice kernel

2022-12-06 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18280.
---
Resolution: Fixed

Issue resolved by pull request 14749
https://github.com/apache/arrow/pull/14749

> [C++][Python] Support slicing to arbitrary end in list_slice kernel
> ---
>
> Key: ARROW-18280
> URL: https://issues.apache.org/jira/browse/ARROW-18280
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Miles Granger
>Assignee: Miles Granger
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> [GitHub PR | https://github.com/apache/arrow/pull/14395] adds support for 
> {{list_slice}} kernel, but does not implement what to do when {{stop == 
> std::nullopt}}, which should slice to the end of the list elements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18400) [Python] Quadratic memory usage of Table.to_pandas with nested data

2022-12-02 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17642421#comment-17642421
 ] 

Joris Van den Bossche commented on ARROW-18400:
---

While combining the chunks before converting to pandas is a useful workaround, 
that still seems to point to a bug or inefficiency in the 
pyarrow.Table->pandas.DataFrame conversion for nested columns.

The main conversion logic for this lives in {{arrow_to_pandas.cc}} 
({{ConvertStruct}} and {{ConvertListLike}}). For structs, it seems to iterate 
over the chunks of the ChunkedArray, and in that loop, iterate over the fields 
to convert each field array to a numpy array and then create python 
dictionaries combining those fields. And those dictionaries are inserted into 
the numpy object dtype array for the full column (allocated in advance, for all 
chunks). 
So from that quick look, I don't directly see why combining the chunks in 
advance would help / why having multiple chunks results in a higher memory 
usage.  

> [Python] Quadratic memory usage of Table.to_pandas with nested data
> ---
>
> Key: ARROW-18400
> URL: https://issues.apache.org/jira/browse/ARROW-18400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.1
> Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X 
> with 64 GB RAM
>Reporter: Adam Reeve
>Assignee: Alenka Frim
>Priority: Critical
> Fix For: 11.0.0
>
>
> Reading nested Parquet data and then converting it to a Pandas DataFrame 
> shows quadratic memory usage and will eventually run out of memory for 
> reasonably small files. I had initially thought this was a regression since 
> 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks 
> in at higher row counts.
> Example code to generate nested Parquet data:
> {code:python}
> import numpy as np
> import random
> import string
> import pandas as pd
> _characters = string.ascii_uppercase + string.digits + string.punctuation
> def make_random_string(N=10):
>     return ''.join(random.choice(_characters) for _ in range(N))
> nrows = 1_024_000
> filename = 'nested.parquet'
> arr_len = 10
> nested_col = []
> for i in range(nrows):
>     nested_col.append(np.array(
>             [{
>                 'a': None if i % 1000 == 0 else np.random.choice(1, 
> size=3).astype(np.int64),
>                 'b': None if i % 100 == 0 else random.choice(range(100)),
>                 'c': None if i % 10 == 0 else make_random_string(5)
>             } for i in range(arr_len)]
>         ))
> df = pd.DataFrame({'c1': nested_col})
> df.to_parquet(filename)
> {code}
> And then read into a DataFrame with:
> {code:python}
> import pyarrow.parquet as pq
> table = pq.read_table(filename)
> df = table.to_pandas()
> {code}
> Only reading to an Arrow table isn't a problem, it's the to_pandas method 
> that exhibits the large memory usage. I haven't tested generating nested 
> Arrow data in memory without writing Parquet from Pandas but I assume the 
> problem probably isn't Parquet specific.
> Memory usage I see when reading different sized files on a machine with 64 GB 
> RAM:
> ||Num rows||Memory used with 10.0.1 (MB)||Memory used with 7.0.0 (MB)||
> |32,000|362|361|
> |64,000|531|531|
> |128,000|1,152|1,101|
> |256,000|2,888|1,402|
> |512,000|10,301|3,508|
> |1,024,000|38,697|5,313|
> |2,048,000|OOM|20,061|
> |4,096,000| |OOM|
> With Arrow 10.0.1, memory usage approximately quadruples when row count 
> doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but 
> then quadruples from 1024k to 2048k rows.
> PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something 
> changed between 7.0.0 and 8.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18265) [C++] Allow FieldPath to work with ListElement

2022-12-01 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641982#comment-17641982
 ] 

Joris Van den Bossche commented on ARROW-18265:
---

bq. However, it would be a bit tricky if FindOne/FindAll ended up calling 
list_element (a compute function). FindOne/FindAll currently is in the core 
module and not the compute module. 

That's a good point. So, for doing FindOne/FindAll itself, we won't need 
"list_element" kernel, since this returns a path, and for that we don't need to 
do the actual selection from the array. 
There is however a set of {{FieldPath::Get(..)}} methods that you can call on 
the resulting FieldPath. For "getting" the path from a schema or type, that's 
still OK (no need for the actual kernel), but there is a signature for 
{{Get()}} to apply the path on a record batch or array. For that specific case, 
one would actually need to use the kernel. Now, I am not sure we actually use 
this variant internally, though (outside of a few tests). 

> [C++] Allow FieldPath to work with ListElement
> --
>
> Key: ARROW-18265
> URL: https://issues.apache.org/jira/browse/ARROW-18265
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Miles Granger
>Assignee: Miles Granger
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> {{FieldRef::FromDotPath}} can parse a single list element field. ie. 
> {{{}'path.to.list[0]`{}}}but does not work in practice. Failing with:
> _struct_field: cannot subscript field of type list<>_
> Being able to add a slice or multiple list elements is not within the scope 
> of this issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18375) MIGRATION: Enable GitHub issue type labels

2022-12-01 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641750#comment-17641750
 ] 

Joris Van den Bossche commented on ARROW-18375:
---

bq. I use "Type: enhancement" for user-visible enhancements (such as new 
features, performance improvements...) and "Type: task" for things that don't 
affect them directly (such as an internal refactor).

Yes, that's what I meant as well, but with a much clearer phrasing ;) 

> MIGRATION: Enable GitHub issue type labels
> --
>
> Key: ARROW-18375
> URL: https://issues.apache.org/jira/browse/ARROW-18375
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Priority: Major
>
> As part of enabling GitHub issue reporting, the following labels have been 
> defined and need to be added to the repository label options. Without these 
> labels added, [new issues|https://github.com/apache/arrow/issues/14692] do 
> not get the issue template-defined issue type labels set properly.
>  
> Labels:
>  * Type: bug
>  * Type: enhancement
>  * Type: usage
>  * Type: task
>  * Type: test
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18375) MIGRATION: Enable GitHub issue type labels

2022-11-30 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641378#comment-17641378
 ] 

Joris Van den Bossche commented on ARROW-18375:
---

I interpret "enhancement" as an enhancement in the functionality _of our 
libraries_ (you could indeed use "feature" for that, but not every enhancement 
is necessarily a new feature, if you interpret that more strictly as a 
completely new functionality, and not enhancement to existing features). 
And then the things you list as refactoring, adding tests, adding CI, update 
packaging/release tooling etc would be things that fall outside of that 
"enhancement/feature" category (although they of course _enhance_ our codebase, 
at least that's typically the goal ;)).

> MIGRATION: Enable GitHub issue type labels
> --
>
> Key: ARROW-18375
> URL: https://issues.apache.org/jira/browse/ARROW-18375
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Priority: Major
>
> As part of enabling GitHub issue reporting, the following labels have been 
> defined and need to be added to the repository label options. Without these 
> labels added, [new issues|https://github.com/apache/arrow/issues/14692] do 
> not get the issue template-defined issue type labels set properly.
>  
> Labels:
>  * Type: bug
>  * Type: enhancement
>  * Type: usage
>  * Type: task
>  * Type: test
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18375) MIGRATION: Enable GitHub issue type labels

2022-11-30 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641367#comment-17641367
 ] 

Joris Van den Bossche commented on ARROW-18375:
---

(I added "Type: test" and "Type: task" as labels on github, so at least the 
labels required for the migration are present now)

> MIGRATION: Enable GitHub issue type labels
> --
>
> Key: ARROW-18375
> URL: https://issues.apache.org/jira/browse/ARROW-18375
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Priority: Major
>
> As part of enabling GitHub issue reporting, the following labels have been 
> defined and need to be added to the repository label options. Without these 
> labels added, [new issues|https://github.com/apache/arrow/issues/14692] do 
> not get the issue template-defined issue type labels set properly.
>  
> Labels:
>  * Type: bug
>  * Type: enhancement
>  * Type: usage
>  * Type: task
>  * Type: test
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18375) MIGRATION: Enable GitHub issue type labels

2022-11-30 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641351#comment-17641351
 ] 

Joris Van den Bossche commented on ARROW-18375:
---

We should probably also add "Type: test" and "Type: task" as github labels, 
even if we don't use them directly on github, but because the current migration 
scripts will use those labels?

For GitHub itself, I also think we should add another form so we can report 
issues that are not "bug" or "enhancement".  Do we want to preserve the 
distinction between "test" and "task", or can we consolidate those? (as any to 
do item that is not a bug or enhancement)


> MIGRATION: Enable GitHub issue type labels
> --
>
> Key: ARROW-18375
> URL: https://issues.apache.org/jira/browse/ARROW-18375
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Priority: Major
>
> As part of enabling GitHub issue reporting, the following labels have been 
> defined and need to be added to the repository label options. Without these 
> labels added, [new issues|https://github.com/apache/arrow/issues/14692] do 
> not get the issue template-defined issue type labels set properly.
>  
> Labels:
>  * Type: bug
>  * Type: enhancement
>  * Type: usage
>  * Type: task
>  * Type: test
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18376) MIGRATION: Add component labels to GitHub

2022-11-30 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641152#comment-17641152
 ] 

Joris Van den Bossche commented on ARROW-18376:
---

Those are now all present as labels, so this can be closed?

> MIGRATION: Add component labels to GitHub
> -
>
> Key: ARROW-18376
> URL: https://issues.apache.org/jira/browse/ARROW-18376
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Priority: Major
>
> Similar to ARROW-18375, component labels have been established based on 
> existing component values defined in ASF Jira. The following labels are 
> needed:
> * Component: Archery
> * Component: Benchmarking
> * Component: C
> * Component: C#
> * Component: C++
> * Component: C++ - Gandiva
> * Component: C++ - Plasma
> * Component: Continuous Integration
> * Component: Dart
> * Component: Developer Tools
> * Component: Documentation
> * Component: FlightRPC
> * Component: Format
> * Component: GLib
> * Component: Go
> * Component: GPU
> * Component: Integration
> * Component: Java
> * Component: JavaScript
> * Component: MATLAB
> * Component: Packaging
> * Component: Parquet
> * Component: Python
> * Component: R
> * Component: Ruby
> * Component: Swift
> * Component: Website
> * Component: Other



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18125) [Python] Handle pytest 8 deprecations about pytest.warns(None)

2022-11-30 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18125.
---
Resolution: Fixed

Issue resolved by pull request 14729
https://github.com/apache/arrow/pull/14729

> [Python] Handle pytest 8 deprecations about pytest.warns(None) 
> ---
>
> Key: ARROW-18125
> URL: https://issues.apache.org/jira/browse/ARROW-18125
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Miles Granger
>Priority: Major
> Fix For: 11.0.0
>
>
> We have a few warnings about that when running the tests, for example:
> {code}
> pyarrow/tests/test_pandas.py::TestConvertMetadata::test_rangeindex_doesnt_warn
> pyarrow/tests/test_pandas.py::TestConvertMetadata::test_multiindex_doesnt_warn
>   
> /home/joris/miniconda3/envs/arrow-dev/lib/python3.10/site-packages/_pytest/python.py:192:
>  PytestRemovedIn8Warning: Passing None has been deprecated.
>   See 
> https://docs.pytest.org/en/latest/how-to/capture-warnings.html#additional-use-cases-of-warnings-in-tests
>  for alternatives in common use cases.
> result = testfunction(**testargs)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18399) [Python] Reduce warnings during tests

2022-11-30 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18399.
---
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14729
https://github.com/apache/arrow/pull/14729

> [Python] Reduce warnings during tests
> -
>
> Key: ARROW-18399
> URL: https://issues.apache.org/jira/browse/ARROW-18399
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Miles Granger
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> Numerous warnings are displayed at the end of a test run, we should strive 
> them to reduce them:
> https://github.com/apache/arrow/actions/runs/3533792571/jobs/5929880345#step:6:5489



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18125) [Python] Handle pytest 8 deprecations about pytest.warns(None)

2022-11-30 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-18125:
-

Assignee: Miles Granger

> [Python] Handle pytest 8 deprecations about pytest.warns(None) 
> ---
>
> Key: ARROW-18125
> URL: https://issues.apache.org/jira/browse/ARROW-18125
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Miles Granger
>Priority: Major
> Fix For: 11.0.0
>
>
> We have a few warnings about that when running the tests, for example:
> {code}
> pyarrow/tests/test_pandas.py::TestConvertMetadata::test_rangeindex_doesnt_warn
> pyarrow/tests/test_pandas.py::TestConvertMetadata::test_multiindex_doesnt_warn
>   
> /home/joris/miniconda3/envs/arrow-dev/lib/python3.10/site-packages/_pytest/python.py:192:
>  PytestRemovedIn8Warning: Passing None has been deprecated.
>   See 
> https://docs.pytest.org/en/latest/how-to/capture-warnings.html#additional-use-cases-of-warnings-in-tests
>  for alternatives in common use cases.
> result = testfunction(**testargs)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-18359) PrettyPrint Improvements

2022-11-29 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17640819#comment-17640819
 ] 

Joris Van den Bossche edited comment on ARROW-18359 at 11/29/22 4:27 PM:
-

Also linking ARROW-14799, as that is another high-level potential change for 
Tables/RecordBatches


was (Author: jorisvandenbossche):
Also linking https://issues.apache.org/jira/browse/ARROW-14799, as that is 
another high-level potential change for Tables/RecordBatches

> PrettyPrint Improvements
> 
>
> Key: ARROW-18359
> URL: https://issues.apache.org/jira/browse/ARROW-18359
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python, R
>Reporter: Will Jones
>Priority: Major
>
> We have some pretty printing capabilities, but we may want to think at a high 
> level about the design first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18359) PrettyPrint Improvements

2022-11-29 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17640819#comment-17640819
 ] 

Joris Van den Bossche commented on ARROW-18359:
---

Also linking https://issues.apache.org/jira/browse/ARROW-14799, as that is 
another high-level potential change for Tables/RecordBatches

> PrettyPrint Improvements
> 
>
> Key: ARROW-18359
> URL: https://issues.apache.org/jira/browse/ARROW-18359
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python, R
>Reporter: Will Jones
>Priority: Major
>
> We have some pretty printing capabilities, but we may want to think at a high 
> level about the design first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17326) [Go][FlightSQL] Add Support for FlightSQL to Go

2022-11-29 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17326:
--
Component/s: (was: SQL)

> [Go][FlightSQL] Add Support for FlightSQL to Go
> ---
>
> Key: ARROW-17326
> URL: https://issues.apache.org/jira/browse/ARROW-17326
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: FlightRPC, Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Also addresses https://github.com/apache/arrow/issues/12496



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17359) [Go][FlightSQL] Create SQLite example

2022-11-29 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17359:
--
Component/s: (was: SQL)

> [Go][FlightSQL] Create SQLite example
> -
>
> Key: ARROW-17359
> URL: https://issues.apache.org/jira/browse/ARROW-17359
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: FlightRPC, Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17325) AQE should use available column statistics from completed query stages

2022-11-29 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17325:
--
Component/s: Rust - Ballista
 (was: SQL)

> AQE should use available column statistics from completed query stages
> --
>
> Key: ARROW-17325
> URL: https://issues.apache.org/jira/browse/ARROW-17325
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - Ballista
>Reporter: Andy Grove
>Priority: Major
>
> In QueryStageExec.computeStats we copy partial statistics from materlized 
> query stages by calling QueryStageExec#getRuntimeStatistics, which in turn 
> calls ShuffleExchangeLike#runtimeStatistics or 
> BroadcastExchangeLike#runtimeStatistics.
> Only dataSize and numOutputRows are copied into the new Statistics object:
>  {code:scala}
>   def computeStats(): Option[Statistics] = if (isMaterialized) {
>     val runtimeStats = getRuntimeStatistics
>     val dataSize = runtimeStats.sizeInBytes.max(0)
>     val numOutputRows = runtimeStats.rowCount.map(_.max(0))
>     Some(Statistics(dataSize, numOutputRows, isRuntime = true))
>   } else {
>     None
>   }
> {code}
> I would like to also copy over the column statistics stored in 
> Statistics.attributeMap so that they can be fed back into the logical plan 
> optimization phase. This is a small change as shown below:
> {code:scala}
>   def computeStats(): Option[Statistics] = if (isMaterialized) {
> val runtimeStats = getRuntimeStatistics
> val dataSize = runtimeStats.sizeInBytes.max(0)
> val numOutputRows = runtimeStats.rowCount.map(_.max(0))
> val attributeStats = runtimeStats.attributeStats
> Some(Statistics(dataSize, numOutputRows, attributeStats, isRuntime = 
> true))
>   } else {
> None
>   }
> {code}
> The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do 
> not currently provide such column statistics, but other custom 
> implementations can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18234) [Swift] Swift implementation of Arrow

2022-11-29 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18234:
--
Component/s: (was: Swift)

> [Swift] Swift implementation of Arrow
> -
>
> Key: ARROW-18234
> URL: https://issues.apache.org/jira/browse/ARROW-18234
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Alva Bandy
>Assignee: Alva Bandy
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Initial check-in for a swift implementation of Arrow. Based on my 
> understanding of the spec and looking through the C++ and C# current 
> implementations.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-2631) [Dart] Begin a Dart language library

2022-11-29 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-2631:
-
Component/s: (was: Dart)

> [Dart] Begin a Dart language library
> 
>
> Key: ARROW-2631
> URL: https://issues.apache.org/jira/browse/ARROW-2631
> Project: Apache Arrow
>  Issue Type: New Feature
> Environment: mobile
>Reporter: Gerard Webb
>Priority: Major
>  Labels: newbie
>
> as per here:
> [https://github.com/apache/arrow/issues/2066]
>  
> Dart now has FlatBuffers !! Woow.
> So lets put a basic example into Arrow to get the ball rolling
> Suggest a simple Flutter client consuming a dart and golang flatbuffers type 
> / kind.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18380) MIGRATION: Enable bot handling of GitHub issue linked PRs

2022-11-24 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17638335#comment-17638335
 ] 

Joris Van den Bossche commented on ARROW-18380:
---

> Here is an example of GitHub bot comments that should be evaluated.

Should we just disable/remove that bot altogether? Since it only points to open 
a JIRA issue, which we no longer want tin the future. 

And we should maybe already disable that bot right now, since such a post will 
be confusing for a potential new contributor that then actually can't open a 
JIRA ...

cc [~assignUser]


> MIGRATION: Enable bot handling of GitHub issue linked PRs
> -
>
> Key: ARROW-18380
> URL: https://issues.apache.org/jira/browse/ARROW-18380
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Priority: Major
>
> GitHub workflows for the Apache Arrow project assume that PRs reference ASF 
> Jira issues (or are minor changes). This needs to be revisited now that 
> GitHub issue reporting is enabled, as there may well be no ASF Jira issue to 
> link a PR against going forward. The resulting bot comments can be confusing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18394) [CI][Python] Nightly pyhon pandas jobs using latest or upstream_devel fail

2022-11-24 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-18394:
-

Assignee: Joris Van den Bossche

> [CI][Python] Nightly pyhon pandas jobs using latest or upstream_devel fail
> --
>
> Key: ARROW-18394
> URL: https://issues.apache.org/jira/browse/ARROW-18394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Assignee: Joris Van den Bossche
>Priority: Critical
>  Labels: Nightly
> Fix For: 11.0.0
>
>
> Currently the following jobs fail:
> |test-conda-python-3.8-pandas-nightly|https://github.com/ursacomputing/crossbow/actions/runs/3532562061/jobs/5927065343|
> |test-conda-python-3.9-pandas-upstream_devel|https://github.com/ursacomputing/crossbow/actions/runs/3532562477/jobs/5927066168|
> with:
> {code:java}
>   _ test_roundtrip_with_bytes_unicode[columns0] 
> __columns = [b'foo']    @pytest.mark.parametrize('columns', 
> ([b'foo'], ['foo']))
>     def test_roundtrip_with_bytes_unicode(columns):
>         df = pd.DataFrame(columns=columns)
>         table1 = pa.Table.from_pandas(df)
> >       table2 = 
> > pa.Table.from_pandas(table1.to_pandas())opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/tests/test_pandas.py:2867:
> >  
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> pyarrow/array.pxi:830: in pyarrow.lib._PandasConvertible.to_pandas
>     ???
> pyarrow/table.pxi:3908: in pyarrow.lib.Table._to_pandas
>     ???
> opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:819:
>  in table_to_blockmanager
>     columns = _deserialize_column_index(table, all_columns, column_indexes)
> opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:935:
>  in _deserialize_column_index
>     columns = _reconstruct_columns_from_metadata(columns, column_indexes)
> opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/pandas_compat.py:1154:
>  in _reconstruct_columns_from_metadata
>     level = level.astype(dtype)
> opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:1029:
>  in astype
>     return Index(new_values, name=self.name, dtype=new_values.dtype, 
> copy=False)
> opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:518:
>  in __new__
>     klass = cls._dtype_to_subclass(arr.dtype)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ cls = , dtype = dtype('S3')    
> @final
>     @classmethod
>     def _dtype_to_subclass(cls, dtype: DtypeObj):
>         # Delay import for perf. 
> https://github.com/pandas-dev/pandas/pull/31423
>     
>         if isinstance(dtype, ExtensionDtype):
>             if isinstance(dtype, DatetimeTZDtype):
>                 from pandas import DatetimeIndex
>     
>                 return DatetimeIndex
>             elif isinstance(dtype, CategoricalDtype):
>                 from pandas import CategoricalIndex
>     
>                 return CategoricalIndex
>             elif isinstance(dtype, IntervalDtype):
>                 from pandas import IntervalIndex
>     
>                 return IntervalIndex
>             elif isinstance(dtype, PeriodDtype):
>                 from pandas import PeriodIndex
>     
>                 return PeriodIndex
>     
>             return Index
>     
>         if dtype.kind == "M":
>             from pandas import DatetimeIndex
>     
>             return DatetimeIndex
>     
>         elif dtype.kind == "m":
>             from pandas import TimedeltaIndex
>     
>             return TimedeltaIndex
>     
>         elif dtype.kind == "f":
>             from pandas.core.api import Float64Index
>     
>             return Float64Index
>         elif dtype.kind == "u":
>             from pandas.core.api import UInt64Index
>     
>             return UInt64Index
>         elif dtype.kind == "i":
>             from pandas.core.api import Int64Index
>     
>             return Int64Index
>     
>         elif dtype.kind == "O":
>             # NB: assuming away MultiIndex
>             return Index
>     
>         elif issubclass(
>             dtype.type, (str, bool, np.bool_, complex, np.complex64, 
> np.complex128)
>         ):
>             return Index
>     
> >       raise NotImplementedError(dtype)
> E       NotImplementedError: 
> |S3opt/conda/envs/arrow/lib/python3.8/site-packages/pandas/core/indexes/base.py:595:
>  NotImplementedError{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18399) [Python] Reduce warnings during tests

2022-11-24 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17638262#comment-17638262
 ] 

Joris Van den Bossche commented on ARROW-18399:
---

We have ARROW-17651 and ARROW-18125 already for specific subsets of those 
warnings (in case you might tackle them as separate PRs, you can use those / 
make them child tasks of this one, or otherwise close them as duplicate)

> [Python] Reduce warnings during tests
> -
>
> Key: ARROW-18399
> URL: https://issues.apache.org/jira/browse/ARROW-18399
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Miles Granger
>Priority: Minor
>
> Numerous warnings are displayed at the end of a test run, we should strive 
> them to reduce them:
> https://github.com/apache/arrow/actions/runs/3533792571/jobs/5929880345#step:6:5489



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18373) MIGRATION: Enable multiple component selection in issue templates

2022-11-24 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18373.
---
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14688
[https://github.com/apache/arrow/pull/14688]

> MIGRATION: Enable multiple component selection in issue templates
> -
>
> Key: ARROW-18373
> URL: https://issues.apache.org/jira/browse/ARROW-18373
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Todd Farmer
>Assignee: Todd Farmer
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Per comments in [this merged PR|https://github.com/apache/arrow/pull/14675], 
> we would like to enable selection of multiple components when reporting 
> issues via GitHub issues.
> Additionally, we may want to add the needed Apache license to the issue 
> templates and remove the exclusion rules from rat_exclude_files.txt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-18265) [C++] Allow FieldPath to work with ListElement

2022-11-23 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637679#comment-17637679
 ] 

Joris Van den Bossche edited comment on ARROW-18265 at 11/23/22 10:02 AM:
--

Yes, but the square-bracket form is basically the same as a single field path 
index, that is how it gets parsed (the top post could have been more explicit 
about this, instead of only mentioning FromDotPath, but also the resulting 
FieldPath). 

Using a small workaround through StructFieldOptions to show this in python:

{code}
>>> pc.StructFieldOptions(".x[0]")
StructFieldOptions(field_ref=FieldRef.Nested(FieldRef.Name(x) 
FieldRef.FieldPath(0)))
>>> pc.StructFieldOptions(["x", 0])
StructFieldOptions(field_ref=FieldRef.Nested(FieldRef.Name(x) 
FieldRef.FieldPath(0)))
{code}

So those two give currently the same result. Essentially, at the moment a "[0]" 
element in a path gets interpreted as {{pc.field(0)}}.

Given the above, AFAIU, we either agree that {{field_ref(0)}} for a list type 
means selecting the first element of each list (and thus not for selecting the 
flat values), or we either have to introduce a new concept in 
FieldRef/FieldPath to represent a list element selection (and thus also change 
how a "[0]" in a string path gets parsed to use this new object).


was (Author: jorisvandenbossche):
Yes, but the square-bracket form is basically the same as a single number, that 
is how it gets parsed (the top post could have been more explicit about this, 
instead of only mentioning FromDotPath, but also the resulting FieldPath). 

Using a small workaround through StructFieldOptions to show this in python:

{code}
>>> pc.StructFieldOptions(".x[0]")
StructFieldOptions(field_ref=FieldRef.Nested(FieldRef.Name(x) 
FieldRef.FieldPath(0)))
>>> pc.StructFieldOptions(["x", 0])
StructFieldOptions(field_ref=FieldRef.Nested(FieldRef.Name(x) 
FieldRef.FieldPath(0)))
{code}

So those two give currently the same result. Essentially, at the moment a "[0]" 
element in a path gets interpreted as {{pc.field(0)}}.

Given the above, AFAIU, we either agree that {{field_ref(0)}} for a list type 
means selecting the first element of each list (and thus not for selecting the 
flat values), or we either have to introduce a new concept in 
FieldRef/FieldPath to represent a list element selection (and thus also change 
how a "[0]" in a string path gets parsed to use this new object).

> [C++] Allow FieldPath to work with ListElement
> --
>
> Key: ARROW-18265
> URL: https://issues.apache.org/jira/browse/ARROW-18265
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Miles Granger
>Assignee: Miles Granger
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> {{FieldRef::FromDotPath}} can parse a single list element field. ie. 
> {{{}'path.to.list[0]`{}}}but does not work in practice. Failing with:
> _struct_field: cannot subscript field of type list<>_
> Being able to add a slice or multiple list elements is not within the scope 
> of this issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18265) [C++] Allow FieldPath to work with ListElement

2022-11-23 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637679#comment-17637679
 ] 

Joris Van den Bossche commented on ARROW-18265:
---

Yes, but the square-bracket form is basically the same as a single number, that 
is how it gets parsed (the top post could have been more explicit about this, 
instead of only mentioning FromDotPath, but also the resulting FieldPath). 

Using a small workaround through StructFieldOptions to show this in python:

{code}
>>> pc.StructFieldOptions(".x[0]")
StructFieldOptions(field_ref=FieldRef.Nested(FieldRef.Name(x) 
FieldRef.FieldPath(0)))
>>> pc.StructFieldOptions(["x", 0])
StructFieldOptions(field_ref=FieldRef.Nested(FieldRef.Name(x) 
FieldRef.FieldPath(0)))
{code}

So those two give currently the same result. Essentially, at the moment a "[0]" 
element in a path gets interpreted as {{pc.field(0)}}.

Given the above, AFAIU, we either agree that {{field_ref(0)}} for a list type 
means selecting the first element of each list (and thus not for selecting the 
flat values), or we either have to introduce a new concept in 
FieldRef/FieldPath to represent a list element selection (and thus also change 
how a "[0]" in a string path gets parsed to use this new object).

> [C++] Allow FieldPath to work with ListElement
> --
>
> Key: ARROW-18265
> URL: https://issues.apache.org/jira/browse/ARROW-18265
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Miles Granger
>Assignee: Miles Granger
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> {{FieldRef::FromDotPath}} can parse a single list element field. ie. 
> {{{}'path.to.list[0]`{}}}but does not work in practice. Failing with:
> _struct_field: cannot subscript field of type list<>_
> Being able to add a slice or multiple list elements is not within the scope 
> of this issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18265) [C++] Allow FieldPath to work with ListElement

2022-11-22 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637388#comment-17637388
 ] 

Joris Van den Bossche commented on ARROW-18265:
---

bq. I think you are referring to this:

Indeed. But so in your code example above, the {{pc.field(0)* 2}} is used to 
refer to the flat list values (the first child field). While the proposal in 
this issue is to have {{pc.field(0)}} mean to select the first element of each 
list value (eg in projections).  
That seems quite a different meaning, and so I am not sure we should use the 
same API for both use cases.



> [C++] Allow FieldPath to work with ListElement
> --
>
> Key: ARROW-18265
> URL: https://issues.apache.org/jira/browse/ARROW-18265
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Miles Granger
>Assignee: Miles Granger
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> {{FieldRef::FromDotPath}} can parse a single list element field. ie. 
> {{{}'path.to.list[0]`{}}}but does not work in practice. Failing with:
> _struct_field: cannot subscript field of type list<>_
> Being able to add a slice or multiple list elements is not within the scope 
> of this issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-18265) [C++] Allow FieldPath to work with ListElement

2022-11-22 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637292#comment-17637292
 ] 

Joris Van den Bossche edited comment on ARROW-18265 at 11/22/22 2:42 PM:
-

[~westonpace] one aspect to explicitly call out: are we OK with the fact that 
an integer index element in a FieldPath _means_ a "list element" selection? (or 
would you prefer to implement this differently?)

Because if we decide that an index means "list element selection" for 
ListTypes, that also means we can't use that to "select" the full (single) 
child field of a ListType. Mentioning this because you used {{field_ref(0)}} / 
{{field_ref("item")}} example as one option in ARROW-17820 how to reference the 
child field (for expresssion you want to apply an element-wise kernel on the 
child array of a ListArray). 

And I don't think we can use {{field_ref(0)}} for both things.



was (Author: jorisvandenbossche):
[~westonpace] one aspect to explicitly call out: are we OK with the fact that 
an integer index element in a FieldPath _means_ a "list element" selection? (or 
would you prefer to implement this differently?)

Because if we decide that an index means "list element selection" for 
ListTypes, that also means we can't use that to "select" the full (single) 
child field of a ListType. Mentioning this because you used {{field_ref(0)}} / 
{{field_ref("item")}} example as one option in 
https://issues.apache.org/jira/browse/ARROW-17820 how to reference the child 
field (for expresssion you want to apply an element-wise kernel on the child 
array of a ListArray).


> [C++] Allow FieldPath to work with ListElement
> --
>
> Key: ARROW-18265
> URL: https://issues.apache.org/jira/browse/ARROW-18265
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Miles Granger
>Assignee: Miles Granger
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> {{FieldRef::FromDotPath}} can parse a single list element field. ie. 
> {{{}'path.to.list[0]`{}}}but does not work in practice. Failing with:
> _struct_field: cannot subscript field of type list<>_
> Being able to add a slice or multiple list elements is not within the scope 
> of this issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18265) [C++] Allow FieldPath to work with ListElement

2022-11-22 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637292#comment-17637292
 ] 

Joris Van den Bossche commented on ARROW-18265:
---

[~westonpace] one aspect to explicitly call out: are we OK with the fact that 
an integer index element in a FieldPath _means_ a "list element" selection? (or 
would you prefer to implement this differently?)

Because if we decide that an index means "list element selection" for 
ListTypes, that also means we can't use that to "select" the full (single) 
child field of a ListType. Mentioning this because you used {{field_ref(0)}} / 
{{field_ref("item")}} example as one option in 
https://issues.apache.org/jira/browse/ARROW-17820 how to reference the child 
field (for expresssion you want to apply an element-wise kernel on the child 
array of a ListArray).


> [C++] Allow FieldPath to work with ListElement
> --
>
> Key: ARROW-18265
> URL: https://issues.apache.org/jira/browse/ARROW-18265
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Miles Granger
>Assignee: Miles Granger
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> {{FieldRef::FromDotPath}} can parse a single list element field. ie. 
> {{{}'path.to.list[0]`{}}}but does not work in practice. Failing with:
> _struct_field: cannot subscript field of type list<>_
> Being able to add a slice or multiple list elements is not within the scope 
> of this issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18379) [Python] Change warnings to _warnings in _plasma_store_entry_point

2022-11-22 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18379.
---
Resolution: Fixed

Issue resolved by pull request 14695
[https://github.com/apache/arrow/pull/14695]

> [Python] Change warnings to _warnings in _plasma_store_entry_point
> --
>
> Key: ARROW-18379
> URL: https://issues.apache.org/jira/browse/ARROW-18379
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0, 10.0.2
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> There is a {{leftover in python/pyarrow/__init__.py}} from 
> [https://github.com/apache/arrow/pull/14343] due to {{warnings}} being 
> imported as {{_warnings}}.
> Connected GitHub issue: [https://github.com/apache/arrow/issues/14693]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17989) [C++] Enable struct_field kernel to accept string field names

2022-11-22 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-17989.
---
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14495
[https://github.com/apache/arrow/pull/14495]

> [C++] Enable struct_field kernel to accept string field names
> -
>
> Key: ARROW-17989
> URL: https://issues.apache.org/jira/browse/ARROW-17989
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Miles Granger
>Priority: Major
>  Labels: compute, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 11h 20m
>  Remaining Estimate: 0h
>
> Currently the "struct_field" kernel only works for integer indices for the 
> child fields. From the StructFieldOption class 
> (https://github.com/apache/arrow/blob/3d7f2f22a0fc441a41b8fa971e11c0f4290ebb24/cpp/src/arrow/compute/api_scalar.h#L283-L285):
> {code}
>   /// The child indices to extract. For instance, to get the 2nd child
>   /// of the 1st child of a struct or union, this would be {0, 1}.
>   std::vector indices;
> {code}
> It would be nice if you could also refer to fields by name in addition to by 
> position.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18173) [Python] Drop older versions of Pandas (<1.0)

2022-11-22 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18173.
---
Resolution: Fixed

Issue resolved by pull request 14631
[https://github.com/apache/arrow/pull/14631]

> [Python] Drop older versions of Pandas (<1.0)
> -
>
> Key: ARROW-18173
> URL: https://issues.apache.org/jira/browse/ARROW-18173
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> We should drop older versions of pandas and support versions > 1.0.
> Older versions are frequently causing issues on the CI.
> Version 1.0.0 was released in January 29, 2020.
> The changes will have to be done in:
>  * the official documentation (pandas version support)
>  * the CI jobs supporting older pandas versions
>  * 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/pandas-shim.pxi]
>  * tests that are specifically testing features on older versions of pandas



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18341) [Doc][Python] Update note about bundling Arrow C++ on Windows

2022-11-21 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18341.
---
Resolution: Fixed

Issue resolved by pull request 14660
[https://github.com/apache/arrow/pull/14660]

> [Doc][Python] Update note about bundling Arrow C++ on Windows
> -
>
> Key: ARROW-18341
> URL: https://issues.apache.org/jira/browse/ARROW-18341
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> There is a note on the python development page under Widnows section about 
> bundling the Arrow C++ libraries with Python extensions:
> [https://arrow.apache.org/docs/dev/developers/python.html#building-on-windows]
> This note can be revised:
>  * if you are using conda, the fact that Arrow C++ libs are not bundled is 
> fine since conda will ensure those libs are found.
>  * If you are not using conda, you have to ensure those libs can be found: 
> either by updating {{PATH}} (every time before importing pyarrow), or either 
> by bundling them (... using the {{PYARROW_BUNDLE_ARROW_CPP}} env variable 
> instead of {{{}--bundle-arrow-cpp{}}}). With the caveat those won't be 
> automatically updated when rebuilding the arrow-cpp libs then.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18225) [Python] write_metadata does not fully use **kwargs

2022-11-21 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18225.
---
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14574
[https://github.com/apache/arrow/pull/14574]

> [Python] write_metadata does not fully use **kwargs
> ---
>
> Key: ARROW-18225
> URL: https://issues.apache.org/jira/browse/ARROW-18225
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: François Chareyron
>Assignee: Miles Granger
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> When using {{write_metadata}}, {{kwargs}} can be used to pass a FileSystem to 
> a ParquetWriter. However, those {{kwargs}} are not passed to 
> {{read_metadata}} later on despite the function accepting a filesystem 
> argument.
> This creates an error when trying to write metadata on a S3FileSystem for 
> example.
> {code:python}
> def write_metadata(schema, where, metadata_collector=None, **kwargs):
> writer = ParquetWriter(where, schema, **kwargs)
> writer.close()
>     if metadata_collector is not None:
> metadata = read_metadata(where) # kwargs should be passed here
> for m in metadata_collector:
> metadata.append_row_groups(m)
> metadata.write_metadata_file(where) # kwargs should be passed here
> {code}
> {code:python}
> def read_metadata(where, memory_map=False, decryption_properties=None,
>   filesystem=None):
> ...{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18363) [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)

2022-11-18 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635876#comment-17635876
 ] 

Joris Van den Bossche commented on ARROW-18363:
---

There is also some work to include this upstream in the sphinx theme: 
https://github.com/pydata/pydata-sphinx-theme/pull/780

> [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)
> -
>
> Key: ARROW-18363
> URL: https://issues.apache.org/jira/browse/ARROW-18363
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Now we have versioned docs, we also have the old versions of the developers 
> docs (eg 
> https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those 
> might be outdated (eg regarding communication channels, build instructions, 
> etc), and typically when contributing / developing with the latest arrow, one 
> should _always_ check the latest dev version of the contributing docs.
> We could add a warning box pointing this out and linking to the dev docs. 
> For example similarly how some projects warn about viewing old docs in 
> general and point to the stable docs (eg https://mne.tools/1.1/index.html or 
> https://scikit-learn.org/1.0/user_guide.html). In this case we could have a 
> custom box when at a page in /developers to point to the dev docs instead of 
> stable docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18363) [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)

2022-11-18 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635873#comment-17635873
 ] 

Joris Van den Bossche commented on ARROW-18363:
---

For example the MNE docs mentioned above do this with a piece of javascript: 
https://github.com/mne-tools/mne-tools.github.io/blob/main/versionwarning.js

> [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)
> -
>
> Key: ARROW-18363
> URL: https://issues.apache.org/jira/browse/ARROW-18363
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Now we have versioned docs, we also have the old versions of the developers 
> docs (eg 
> https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those 
> might be outdated (eg regarding communication channels, build instructions, 
> etc), and typically when contributing / developing with the latest arrow, one 
> should _always_ check the latest dev version of the contributing docs.
> We could add a warning box pointing this out and linking to the dev docs. 
> For example similarly how some projects warn about viewing old docs in 
> general and point to the stable docs (eg https://mne.tools/1.1/index.html or 
> https://scikit-learn.org/1.0/user_guide.html). In this case we could have a 
> custom box when at a page in /developers to point to the dev docs instead of 
> stable docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18363) [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)

2022-11-18 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635864#comment-17635864
 ] 

Joris Van den Bossche commented on ARROW-18363:
---

Renamed the issue to not be specific about the contributing docs, since we can 
also do this for all docs. I think it would only still be nice if we can 
special case pages in the /developers section, so that for those pages we can 
1) point to the dev docs instead of stable docs, and 2) also show this warning 
for the stable version.

> [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)
> -
>
> Key: ARROW-18363
> URL: https://issues.apache.org/jira/browse/ARROW-18363
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Now we have versioned docs, we also have the old versions of the developers 
> docs (eg 
> https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those 
> might be outdated (eg regarding communication channels, build instructions, 
> etc), and typically when contributing / developing with the latest arrow, one 
> should _always_ check the latest dev version of the contributing docs.
> We could add a warning box pointing this out and linking to the dev docs. 
> For example similarly how some projects warn about viewing old docs in 
> general and point to the stable docs (eg https://mne.tools/1.1/index.html or 
> https://scikit-learn.org/1.0/user_guide.html). In this case we could have a 
> custom box when at a page in /developers to point to the dev docs instead of 
> stable docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18363) [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)

2022-11-18 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18363:
--
Summary: [Docs] Include warning when viewing old docs (redirecting to 
stable/dev docs)  (was: [Docs] Include warning when viewing old contributing 
docs (redirecting to dev docs))

> [Docs] Include warning when viewing old docs (redirecting to stable/dev docs)
> -
>
> Key: ARROW-18363
> URL: https://issues.apache.org/jira/browse/ARROW-18363
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Now we have versioned docs, we also have the old versions of the developers 
> docs (eg 
> https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those 
> might be outdated (eg regarding communication channels, build instructions, 
> etc), and typically when contributing / developing with the latest arrow, one 
> should _always_ check the latest dev version of the contributing docs.
> We could add a warning box pointing this out and linking to the dev docs. 
> For example similarly how some projects warn about viewing old docs in 
> general and point to the stable docs (eg https://mne.tools/1.1/index.html or 
> https://scikit-learn.org/1.0/user_guide.html). In this case we could have a 
> custom box when at a page in /developers to point to the dev docs instead of 
> stable docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18363) [Docs] Include warning when viewing old contributing docs (redirecting to dev docs)

2022-11-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18363:
-

 Summary: [Docs] Include warning when viewing old contributing docs 
(redirecting to dev docs)
 Key: ARROW-18363
 URL: https://issues.apache.org/jira/browse/ARROW-18363
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Joris Van den Bossche


Now we have versioned docs, we also have the old versions of the developers 
docs (eg 
https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those 
might be outdated (eg regarding communication channels, build instructions, 
etc), and typically when contributing / developing with the latest arrow, one 
should _always_ check the latest dev version of the contributing docs.

We could add a warning box pointing this out and linking to the dev docs. 

For example similarly how some projects warn about viewing old docs in general 
and point to the stable docs (eg https://mne.tools/1.1/index.html or 
https://scikit-learn.org/1.0/user_guide.html). In this case we could have a 
custom box when at a page in /developers to point to the dev docs instead of 
stable docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18298) [Python] datetime shifted when using pyarrow.Table.from_pandas to load a pandas DateFrame containing datetime with timezone

2022-11-18 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635819#comment-17635819
 ] 

Joris Van den Bossche commented on ARROW-18298:
---

bq. I thought initially it was just how it was presented, as going back to 
pandas in this example from the table gives the "correct" representation of the 
value:

Yes, this is in this case the cause of the confusion. The dates are not "wrong" 
after conversion to arrow, they are just confusingly printed in UTC without any 
indication of this. We have ARROW-14567 to track this issue.

bq. However, placing mixed timezones makes the behavior more apparent in that 
it is coercing to the first timezone.

That's a separate issue (and something that doesn't happen that often, for 
example also pandas requires a single timezone for a column, if you have a 
datetime64 dtype). But indeed, Arrow's timestamp type requires a single 
timezone, and thus when encountering multiple ones, we currently coerce to the 
first one. I think it would be better to coerce to UTC instead (-> ARROW-5912). 
There is some discussion about the use case of actually having multiple 
timezones in a single array at ARROW-16540



> [Python] datetime shifted when using pyarrow.Table.from_pandas to load a 
> pandas DateFrame containing datetime with timezone
> ---
>
> Key: ARROW-18298
> URL: https://issues.apache.org/jira/browse/ARROW-18298
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
> Environment: MacOS M1, Python 3.8.13
>Reporter: Adam Ling
>Priority: Major
>
> Problem:
> When using pyarrow.Table.from_pandas to load a pandas DataFrame which 
> contains a timestamp object with timezone information, the created Table 
> object will shift the datetime, while still keeping the timezone information. 
> Please see my scripts.
>  
> Reproduce scripts:
> {code:java}
> import pandas as pd
> import pyarrow
> ts = pd.Timestamp("2022-10-21 22:46:17", tz="America/Los_Angeles")
> df = pd.DataFrame({"TS": [ts]})
> table = pyarrow.Table.from_pandas(df)
> print(df)
> """
>  TS
> 0 2022-10-21 22:46:17-07:00
> """
> print(table)
> """
> pyarrow.Table
> TS: timestamp[ns, tz=America/Los_Angeles]
> 
> TS: [[2022-10-22 05:46:17.0]]""" {code}
> Expected results:
> The table should not shift the datetime when timezone information is provided.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17136) [C++] HadoopFileSystem open_append_stream throwing an error if file does not exists

2022-11-18 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17136:
--
Summary: [C++] HadoopFileSystem open_append_stream throwing an error if 
file does not exists  (was: [C++] open_append_stream throwing an error if file 
does not exists)

> [C++] HadoopFileSystem open_append_stream throwing an error if file does not 
> exists
> ---
>
> Key: ARROW-17136
> URL: https://issues.apache.org/jira/browse/ARROW-17136
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Sagar Shinde
>Priority: Minor
>
> as per the document method, open_append_stream will create the file if does 
> not exists. But when I try to append to the file in hdfs it is throwing an 
> error like file, not found.
> hdfsOpenFile(/tmp/xyz.json): 
> FileSystem#append((Lorg/apache/hadoop/fs/Path;)Lorg/apache/hadoop/fs/FSDataOutputStream;)
>  error:
> RemoteException: Failed to append to non-existent file /tmp/xyz.json for 
> client
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> java.io.FileNotFoundException: Failed to append to non-existent file 
> /tmp/xyz.json for client x.x.x.x
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>         at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>         at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1367)
>         at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1424)
>         at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1394)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:423)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:419)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(Fil

[jira] [Updated] (ARROW-17136) [C++] HadoopFileSystem open_append_stream throwing an error if file does not exists

2022-11-18 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17136:
--
Labels: good-first-issue  (was: )

> [C++] HadoopFileSystem open_append_stream throwing an error if file does not 
> exists
> ---
>
> Key: ARROW-17136
> URL: https://issues.apache.org/jira/browse/ARROW-17136
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Sagar Shinde
>Priority: Minor
>  Labels: good-first-issue
>
> as per the document method, open_append_stream will create the file if does 
> not exists. But when I try to append to the file in hdfs it is throwing an 
> error like file, not found.
> hdfsOpenFile(/tmp/xyz.json): 
> FileSystem#append((Lorg/apache/hadoop/fs/Path;)Lorg/apache/hadoop/fs/FSDataOutputStream;)
>  error:
> RemoteException: Failed to append to non-existent file /tmp/xyz.json for 
> client
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> java.io.FileNotFoundException: Failed to append to non-existent file 
> /tmp/xyz.json for client x.x.x.x
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>         at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>         at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1367)
>         at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1424)
>         at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1394)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:423)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:419)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.append

[jira] [Updated] (ARROW-17136) [C++] open_append_stream throwing an error if file does not exists

2022-11-18 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17136:
--
Component/s: C++
 (was: Python)

> [C++] open_append_stream throwing an error if file does not exists
> --
>
> Key: ARROW-17136
> URL: https://issues.apache.org/jira/browse/ARROW-17136
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Sagar Shinde
>Priority: Minor
>
> as per the document method, open_append_stream will create the file if does 
> not exists. But when I try to append to the file in hdfs it is throwing an 
> error like file, not found.
> hdfsOpenFile(/tmp/xyz.json): 
> FileSystem#append((Lorg/apache/hadoop/fs/Path;)Lorg/apache/hadoop/fs/FSDataOutputStream;)
>  error:
> RemoteException: Failed to append to non-existent file /tmp/xyz.json for 
> client
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> java.io.FileNotFoundException: Failed to append to non-existent file 
> /tmp/xyz.json for client x.x.x.x
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>         at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>         at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1367)
>         at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1424)
>         at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1394)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:423)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:419)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:431)
>         at 
> org.apache

[jira] [Updated] (ARROW-17136) [C++] open_append_stream throwing an error if file does not exists

2022-11-18 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17136:
--
Summary: [C++] open_append_stream throwing an error if file does not exists 
 (was: open_append_stream throwing an error if file does not exists)

> [C++] open_append_stream throwing an error if file does not exists
> --
>
> Key: ARROW-17136
> URL: https://issues.apache.org/jira/browse/ARROW-17136
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Sagar Shinde
>Priority: Minor
>
> as per the document method, open_append_stream will create the file if does 
> not exists. But when I try to append to the file in hdfs it is throwing an 
> error like file, not found.
> hdfsOpenFile(/tmp/xyz.json): 
> FileSystem#append((Lorg/apache/hadoop/fs/Path;)Lorg/apache/hadoop/fs/FSDataOutputStream;)
>  error:
> RemoteException: Failed to append to non-existent file /tmp/xyz.json for 
> client
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
> java.io.FileNotFoundException: Failed to append to non-existent file 
> /tmp/xyz.json for client x.x.x.x
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:104)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:805)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:487)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>         at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>         at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1367)
>         at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1424)
>         at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1394)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:423)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$5.doCall(DistributedFileSystem.java:419)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.ha

[jira] [Comment Edited] (ARROW-18276) [Python] Reading from hdfs using pyarrow 10.0.0 throws OSError: [Errno 22] Opening HDFS file

2022-11-18 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635303#comment-17635303
 ] 

Joris Van den Bossche edited comment on ARROW-18276 at 11/18/22 9:43 AM:
-

Hi [~moritzmeister] !

Could you try using {{pyarrow}} directly to see if you then get the same error 
when opening the file? 
You can instantiate a {{HadoopFileSystem}} object [from an URI 
string|https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html#pyarrow.fs.HadoopFileSystem.from_uri],
 or using the class constructor directly 
(https://arrow.apache.org/docs/dev/python/filesystems.html#hadoop-distributed-file-system-hdfs).
 Something similar to this:

{code}
from pyarrow import fs
hdfs, _ = 
fs.HadoopFileSystem.from_uri('hdfs://10.0.2.15:8020/Projects/testing/testing_Training_Datasets/transactions_view_fraud_batch_fv_1_1/validation/')
hdfs.open_input_file("/Projects/testing/testing_Training_Datasets/transactions_view_fraud_batch_fv_1_1/validation/part-0-42b57ad2-57eb-4a63-bfaa-7375e82863e8-c000.csv")
{code}

If that works, you can then use {{hdfs}} with {{{}fsspec{}}}:
[https://arrow.apache.org/docs/python/filesystems.html#using-arrow-filesystems-with-fsspec]

and {{fsspec}} API to open the files:
[https://filesystem-spec.readthedocs.io/en/latest/api.html]

Something similar to this:
{code:python}
from pyarrow import fs
hdfs = 
fs.HadoopFileSystem.from_uri('hdfs://10.0.2.15:8020/Projects/testing/testing_Training_Datasets/transactions_view_fraud_batch_fv_1_1/validation/')
from fsspec.implementations.arrow import ArrowFSWrapper
hdfs_fsspec = ArrowFSWrapper(hdfs)
hdfs_fsspec.open_files(...)
{code}
This way you can see if pyarrow 10.0.0 works or errors. And it is more direct 
so less likely to error :)

Also, do you maybe know if the Hadoop installation has changed in this time?


was (Author: alenkaf):
Hi [~moritzmeister] !

Could you try using {{pyarrow}} directly?
You can instantiate {{HadoopFileSystem}} object [from an URI 
string|https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html#pyarrow.fs.HadoopFileSystem.from_uri].

If that works, you can then use {{hdfs}} with {{{}fsspec{}}}:
[https://arrow.apache.org/docs/python/filesystems.html#using-arrow-filesystems-with-fsspec]

and {{fsspec}} API to open the files:
[https://filesystem-spec.readthedocs.io/en/latest/api.html]

Something similar to this:
{code:python}
from pyarrow import fs
hdfs = 
fs.HadoopFileSystem.from_uri('hdfs://10.0.2.15:8020/Projects/testing/testing_Training_Datasets/transactions_view_fraud_batch_fv_1_1/validation/')
from fsspec.implementations.arrow import ArrowFSWrapper
hdfs_fsspec = ArrowFSWrapper(hdfs)
hdfs_fsspec.open_files(...)
{code}
This way you can see if pyarrow 10.0.0 works or errors. And it is more direct 
so less likely to error :)

Also, do you maybe know if the Hadoop installation has changed in this time?

> [Python] Reading from hdfs using pyarrow 10.0.0 throws OSError: [Errno 22] 
> Opening HDFS file
> 
>
> Key: ARROW-18276
> URL: https://issues.apache.org/jira/browse/ARROW-18276
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.0
> Environment: pyarrow 10.0.0
> fsspec 2022.7.1
> pandas 1.3.3
> python 3.8.11.
>Reporter: Moritz Meister
>Priority: Major
>
> Hey!
> I am trying to read a CSV file using pyarrow together with fsspec from HDFS.
> I used to do this with pyarrow 9.0.0 and fsspec 2022.7.1, however, after I 
> upgraded to pyarrow 10.0.0 this stopped working.
> I am not quite sure if this is an incompatibility introduced in the new 
> pyarrow version or if it is a Bug in fsspec. So if I am in the wrong place 
> here, please let me know.
> Apart from pyarrow 10.0.0 and fsspec 2022.7.1, I am using pandas version 
> 1.3.3 and python 3.8.11.
> Here is the full stack trace
> {code:python}
> pd.read_csv("hdfs://10.0.2.15:8020/Projects/testing/testing_Training_Datasets/transactions_view_fraud_batch_fv_1_1/validation/part-0-42b57ad2-57eb-4a63-bfaa-7375e82863e8-c000.csv")
> ---
> OSError                                   Traceback (most recent call last)
> /srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/parsers/readers.py
>  in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, 
> usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, 
> true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, 
> na_values, keep_default_na, na_filter, verbose, skip_blank_lines, 
> parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, 
> cache_dates, iterator, chunksize, comp

[jira] [Commented] (ARROW-18340) [Python] PyArrow C++ header files no longer always included in installed pyarrow

2022-11-16 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634814#comment-17634814
 ] 

Joris Van den Bossche commented on ARROW-18340:
---

cc [~kou] [~raulcd]

> [Python] PyArrow C++ header files no longer always included in installed 
> pyarrow
> 
>
> Key: ARROW-18340
> URL: https://issues.apache.org/jira/browse/ARROW-18340
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 10.0.0
>Reporter: Joris Van den Bossche
>Assignee: Alenka Frim
>Priority: Major
> Fix For: 10.0.1, 11.0.0
>
>
> We have a python build env var to control whether the Arrow C++ header files 
> are included in the python package or not 
> ({{PYARROW_BUNDLE_ARROW_CPP_HEADERS}}). This is set to True by default, and 
> only in the conda recipe set to False.
> After the cmake refactor, the Python C++ header files no longer live in the 
> Arrow C++ package, and so should _always_ be included in the python package, 
> regardless of how arrow-cpp is installed. 
> Initially this was done, but it seems that 
> https://github.com/apache/arrow/pull/13892 removed this unconditional copy of 
> the PyArrow header files to {{pyarrow/include}}. Now it is only copied if 
> {{PYARROW_BUNDLE_ARROW_CPP_HEADERS}} is enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18340) [Python] PyArrow C++ header files no longer always included in installed pyarrow

2022-11-16 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18340:
--
Affects Version/s: 10.0.0

> [Python] PyArrow C++ header files no longer always included in installed 
> pyarrow
> 
>
> Key: ARROW-18340
> URL: https://issues.apache.org/jira/browse/ARROW-18340
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 10.0.0
>Reporter: Joris Van den Bossche
>Assignee: Alenka Frim
>Priority: Major
> Fix For: 10.0.1, 11.0.0
>
>
> We have a python build env var to control whether the Arrow C++ header files 
> are included in the python package or not 
> ({{PYARROW_BUNDLE_ARROW_CPP_HEADERS}}). This is set to True by default, and 
> only in the conda recipe set to False.
> After the cmake refactor, the Python C++ header files no longer live in the 
> Arrow C++ package, and so should _always_ be included in the python package, 
> regardless of how arrow-cpp is installed. 
> Initially this was done, but it seems that 
> https://github.com/apache/arrow/pull/13892 removed this unconditional copy of 
> the PyArrow header files to {{pyarrow/include}}. Now it is only copied if 
> {{PYARROW_BUNDLE_ARROW_CPP_HEADERS}} is enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18340) [Python] PyArrow C++ header files no longer always included in installed pyarrow

2022-11-16 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18340:
--
Component/s: Python

> [Python] PyArrow C++ header files no longer always included in installed 
> pyarrow
> 
>
> Key: ARROW-18340
> URL: https://issues.apache.org/jira/browse/ARROW-18340
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Alenka Frim
>Priority: Major
> Fix For: 10.0.1, 11.0.0
>
>
> We have a python build env var to control whether the Arrow C++ header files 
> are included in the python package or not 
> ({{PYARROW_BUNDLE_ARROW_CPP_HEADERS}}). This is set to True by default, and 
> only in the conda recipe set to False.
> After the cmake refactor, the Python C++ header files no longer live in the 
> Arrow C++ package, and so should _always_ be included in the python package, 
> regardless of how arrow-cpp is installed. 
> Initially this was done, but it seems that 
> https://github.com/apache/arrow/pull/13892 removed this unconditional copy of 
> the PyArrow header files to {{pyarrow/include}}. Now it is only copied if 
> {{PYARROW_BUNDLE_ARROW_CPP_HEADERS}} is enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18340) [Python] PyArrow C++ header files no longer always included in installed pyarrow

2022-11-16 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18340:
--
Fix Version/s: 11.0.0

> [Python] PyArrow C++ header files no longer always included in installed 
> pyarrow
> 
>
> Key: ARROW-18340
> URL: https://issues.apache.org/jira/browse/ARROW-18340
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Joris Van den Bossche
>Assignee: Alenka Frim
>Priority: Major
> Fix For: 10.0.1, 11.0.0
>
>
> We have a python build env var to control whether the Arrow C++ header files 
> are included in the python package or not 
> ({{PYARROW_BUNDLE_ARROW_CPP_HEADERS}}). This is set to True by default, and 
> only in the conda recipe set to False.
> After the cmake refactor, the Python C++ header files no longer live in the 
> Arrow C++ package, and so should _always_ be included in the python package, 
> regardless of how arrow-cpp is installed. 
> Initially this was done, but it seems that 
> https://github.com/apache/arrow/pull/13892 removed this unconditional copy of 
> the PyArrow header files to {{pyarrow/include}}. Now it is only copied if 
> {{PYARROW_BUNDLE_ARROW_CPP_HEADERS}} is enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18340) [Python] PyArrow C++ header files no longer always included in installed pyarrow

2022-11-16 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18340:
-

 Summary: [Python] PyArrow C++ header files no longer always 
included in installed pyarrow
 Key: ARROW-18340
 URL: https://issues.apache.org/jira/browse/ARROW-18340
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Joris Van den Bossche
Assignee: Alenka Frim
 Fix For: 10.0.1


We have a python build env var to control whether the Arrow C++ header files 
are included in the python package or not 
({{PYARROW_BUNDLE_ARROW_CPP_HEADERS}}). This is set to True by default, and 
only in the conda recipe set to False.

After the cmake refactor, the Python C++ header files no longer live in the 
Arrow C++ package, and so should _always_ be included in the python package, 
regardless of how arrow-cpp is installed. 
Initially this was done, but it seems that 
https://github.com/apache/arrow/pull/13892 removed this unconditional copy of 
the PyArrow header files to {{pyarrow/include}}. Now it is only copied if 
{{PYARROW_BUNDLE_ARROW_CPP_HEADERS}} is enabled.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18129) [Python] get_include() gives wrong directory in conda environment

2022-11-16 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18129:
--
Summary: [Python] get_include() gives wrong directory in conda environment  
(was: get_include() gives wrong directory)

> [Python] get_include() gives wrong directory in conda environment
> -
>
> Key: ARROW-18129
> URL: https://issues.apache.org/jira/browse/ARROW-18129
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
> Environment: conda
>Reporter: Left Screen
>Priority: Critical
>  Labels: triaged
>
> {{get_include}} seems to do:
>  
> {code:java}
> def get_include():
>     """
>     Return absolute path to directory containing Arrow C++ include
>     headers. Similar to numpy.get_include
>     """
>     return _os.path.join(_os.path.dirname(__file__), 'include') {code}
> This returns something like:
> {code:java}
> /path/to/myconda/envs/envname/lib/python3.8/site-packages/pyarrow/include{code}
> which does not exist in a conda environment. The path where the headers 
> actually get installed is to:
>  
> {code:java}
> $ echo $CONDA_PREFIX
> /path/to/myconda/envs/envname
> $ ls $CONDA_PREFIX/include/arrow | head
> adapters
> api.h
> array
> array.h
> buffer_builder.h
> buffer.h
> builder.h
> c
> chunked_array.h
> chunk_resolver.h
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters

2022-11-15 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634264#comment-17634264
 ] 

Joris Van den Bossche commented on ARROW-15716:
---

To just OR-combine the different expressions for each of the paths, you can do 
this automatically with {{reduce()}} and a list comprehension calling 
Partitioning.parse on each of the paths (without having to resort to 
{{_get_partition_keys}} and {{filters_to_expression}}). Using your example:

{code}
paths = ['path/to/data/month_id=202105/v1-manual__2022-11-06T22:50:20.parquet',
 'path/to/data/month_id=202106/v1-manual__2022-11-06T22:50:20.parquet',
 'path/to/data/month_id=202107/v1-manual__2022-11-06T22:50:20..parquet']
partitioning = ds.partitioning(pa.schema([('month_id', 'int64')]), 
flavor="hive")


>>> import operator
>>> import functools
>>> functools.reduce(operator.or_, [partitioning.parse(file) for file in paths])

{code}

I think this is what Weston is suggesting to do. It doesn't necessarily give 
the most efficient filter expression, but that's a direct translation of the 
subset of paths (if there are many paths, it might be more efficient with isin 
or a greater/smaller compare kernel)

> [Dataset][Python] Parse a list of fragment paths to gather filters
> --
>
> Key: ARROW-15716
> URL: https://issues.apache.org/jira/browse/ARROW-15716
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 7.0.0
>Reporter: Lance Dacey
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Is it possible for partitioning.parse() to be updated to parse a list of 
> paths instead of just a single path? 
> I am passing the .paths from file_visitor to downstream tasks to process data 
> which was recently saved, but I can run into problems with this if I 
> overwrite data with delete_matching in order to consolidate small files since 
> the paths won't exist. 
> Here is the output of my current approach to use filters instead of reading 
> the paths directly:
> {code:python}
> # Fragments saved during write_dataset 
> ['dev/dataset/fragments/date_id=20210813/data-0.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-2.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-1.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-0.parquet']
> # Run partitioning.parse() on each fragment 
> [, 
> , 
> , 
> ]
> # Format those expressions into a list of tuples
> [('date_id', 'in', [20210114, 20210813])]
> # Convert to an expression which is used as a filter in .to_table()
> is_in(date_id, {value_set=int64:[
>   20210114,
>   20210813
> ], skip_nulls=false})
> {code}
> My hope would be to do something like filt_exp = partitioning.parse(paths) 
> which would return a dataset expression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18264) [Python] Add Time64Scalar.value field

2022-11-15 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18264.
---
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14637
[https://github.com/apache/arrow/pull/14637]

> [Python] Add Time64Scalar.value field
> -
>
> Key: ARROW-18264
> URL: https://issues.apache.org/jira/browse/ARROW-18264
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 10.0.0
> Environment: pyarrow==10.0.0
> No pandas installed
>Reporter: &res
>Assignee: &res
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> At the moment, when pandas is not installed, it is not possible to access the 
> underlying value for a Time64Scalar of "ns" precision without casting it to 
> int64.
> {code:java}
> time_ns = pa.array([1, 2, 3],pa.time64("ns"))
> scalar = time_ns[0]
> scalar.as_py() {code}
> Raises:
> {code:java}
> ValueError: Nanosecond resolution temporal type 1 is not safely convertible 
> to microseconds to convert to datetime.datetime. Install pandas to return as 
> Timestamp with nanosecond support or access the .value attribute{code}
> But value isn't available:
> {code:java}
> scalar.value {code}
> Raises:
> {code:java}
> AttributeError: 'pyarrow.lib.Time64Scalar' object has no attribute 'value' 
> {code}
> The workaround is to do:
> {code:java}
> scalar.cast(pa.int64()).as_py() {code}
> It'd be good if a value field was added to Time64Scalar, just like the 
> TimestampScalar
> {code:java}
> timestamp_ns = pa.array([1, 2, 3],pa.timestamp("ns", "UTC"))
> scalar = timestamp_ns[0]
> scalar.value {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18329) [Python][CI] Support ORC in Windows wheels

2022-11-15 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18329:
-

 Summary: [Python][CI] Support ORC in Windows wheels
 Key: ARROW-18329
 URL: https://issues.apache.org/jira/browse/ARROW-18329
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Now we support building with ORC enabled on Windows (ARROW-17817), we could 
also add this to the Python wheel packages for Windows (vcpkg seems to have an 
orc port for Windows as well)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18257) [Python] array of time64 type changes from Time64Type to DataType

2022-11-14 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18257.
---
Resolution: Fixed

Issue resolved by pull request 14633
[https://github.com/apache/arrow/pull/14633]

> [Python] array of time64 type changes from Time64Type to DataType
> -
>
> Key: ARROW-18257
> URL: https://issues.apache.org/jira/browse/ARROW-18257
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.0
> Environment: python 3.9
> pyarrow 10.0.0
> No pandas installed
>Reporter: &res
>Assignee: &res
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When creating an array of time64 elements, the array type information is 
> changed from Time64Type to DataType. 
> While it's not an issue as such, given it still looks like an array of 
> time64, I can't access special attributes of the Time64Type (for example unit)
>  
> {code:java}
> dtype = pa.time64("ns")
> time_array = pa.array(
> [
> 1,
> 2,
> 3
> ],
> dtype
> )
> assert pa.types.is_time64(time_array.type) is True
> assert isinstance(dtype, pa.Time64Type) is True
> assert isinstance(time_array.type, pa.Time64Type) is False # Wrong
> assert isinstance(time_array.type, pa.DataType) is True # Wrong
> assert dtype == time_array.type
> assert dtype.unit == "ns"
> with pytest.raises(AttributeError, match=r"'pyarrow.lib.DataType' object has 
> no attribute 'unit'"):
> # Should be able to access unit:
> time_array.type.unit{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-9538) [Python] Allow pyarrow.filesystem.resolve_filesystem_and_path to parse S3 URL

2022-11-10 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-9538.

Resolution: Won't Fix

> [Python] Allow pyarrow.filesystem.resolve_filesystem_and_path to parse S3 URL
> -
>
> Key: ARROW-9538
> URL: https://issues.apache.org/jira/browse/ARROW-9538
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.1
>Reporter: Adam Avilla
>Priority: Minor
>  Labels: filesystem
>
> {{pyarrow.filesystem.resolve_filesystem_and_path}} should support a {{where}} 
> that is a S3 URL like:
> {code:java}
> s3://bucket/folder/file.ext{code}
> It seems like all the pieces are there but was never developed. If given some 
> light guidance I may be able to add the code in a PR.
> Thanks and LMK if this is a crazy request!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-9538) [Python] Allow pyarrow.filesystem.resolve_filesystem_and_path to parse S3 URL

2022-11-10 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631653#comment-17631653
 ] 

Joris Van den Bossche commented on ARROW-9538:
--

This is working for {{pyarrow.fs}}, and so it should also work for all methods 
like {{pq.read_table(..)}} or {{ds.dataset(..)}} that use this under the hood, 
and so those should support accepting a S3 URI.

{code}
In [1]: import pyarrow.fs

In [2]: pyarrow.fs._resolve_filesystem_and_path("s3://bucket/folder/file.ext")
Out[2]: (, 
'bucket/folder/file.ext')
{code}

Since pyarrow.filesystem module is deprecated and we are not developing that 
anymore, going to close this issue.

> [Python] Allow pyarrow.filesystem.resolve_filesystem_and_path to parse S3 URL
> -
>
> Key: ARROW-9538
> URL: https://issues.apache.org/jira/browse/ARROW-9538
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.1
>Reporter: Adam Avilla
>Priority: Minor
>  Labels: filesystem
>
> {{pyarrow.filesystem.resolve_filesystem_and_path}} should support a {{where}} 
> that is a S3 URL like:
> {code:java}
> s3://bucket/folder/file.ext{code}
> It seems like all the pieces are there but was never developed. If given some 
> light guidance I may be able to add the code in a PR.
> Thanks and LMK if this is a crazy request!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18297) [Python] from/to pandas with MultiIndex raises incorrectly

2022-11-10 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18297:
--
Summary: [Python] from/to pandas with MultiIndex raises incorrectly  (was: 
from/to pandas with MultiIndex raises incorrectly)

> [Python] from/to pandas with MultiIndex raises incorrectly
> --
>
> Key: ARROW-18297
> URL: https://issues.apache.org/jira/browse/ARROW-18297
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Shoham Debnath
>Priority: Major
>
> The error only throws, when one Index is RangeIndex and the other isn't
> {code:java}
> df = pd.DataFrame({"a":[1,2], "b":[3,4]})
> df = df.set_index(["a"], append=True)
> pa.Table.from_pandas(df).to_pandas()
> Traceback (most recent call last):
>   File 
> "/Users/debnathshoham/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/IPython/core/interactiveshell.py",
>  line 3378, in run_code
>     exec(code_obj, self.user_global_ns, self.user_ns)
>   File "", line 1, in 
>     pa.Table.from_pandas(df).to_pandas()
>   File "pyarrow/array.pxi", line 823, in 
> pyarrow.lib._PandasConvertible.to_pandas
>   File "pyarrow/table.pxi", line 3913, in pyarrow.lib.Table._to_pandas
>   File 
> "/Users/debnathshoham/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py",
>  line 808, in table_to_blockmanager
>     table, index = _reconstruct_index(table, index_descriptors,
>   File 
> "/Users/debnathshoham/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py",
>  line 959, in _reconstruct_index
>     result_table, index_level, index_name = _extract_index_level(
>   File 
> "/Users/debnathshoham/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py",
>  line 997, in _extract_index_level
>     logical_name = field_name_to_metadata[field_name]['name']
> KeyError: 'a' {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18164) [Python] Dataset scanner does not follow default memory pool setting

2022-11-09 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18164.
---
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14516
[https://github.com/apache/arrow/pull/14516]

> [Python] Dataset scanner does not follow default memory pool setting
> 
>
> Key: ARROW-18164
> URL: https://issues.apache.org/jira/browse/ARROW-18164
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Even if I set the system memory pool as default, it still uses the jemalloc 
> one (running this on Ubuntu where jemalloc is the default if not set by the 
> user):
> {code}
> import pyarrow as pa
> import pyarrow.dataset as ds
> import pyarrow.parquet as pq
> pq.write_table(pa.table({'a': [1, 2, 3]}), "test.parquet")
> In [2]: pa.set_memory_pool(pa.system_memory_pool())
> In [3]: pa.total_allocated_bytes()
> Out[3]: 0
> In [4]: table = ds.dataset("test.parquet").to_table()
> In [5]: pa.total_allocated_bytes()
> Out[5]: 0
> In [6]: pa.set_memory_pool(pa.jemalloc_memory_pool())
> In [7]: pa.total_allocated_bytes()
> Out[7]: 128
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18229) [C++][Python] RecordBatchReader can be created with a 'dict' schema which then crashes on use

2022-11-09 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18229.
---
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14583
[https://github.com/apache/arrow/pull/14583]

> [C++][Python] RecordBatchReader can be created with a 'dict' schema which 
> then crashes on use
> -
>
> Key: ARROW-18229
> URL: https://issues.apache.org/jira/browse/ARROW-18229
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 10.0.0
>Reporter: David Li
>Assignee: Joris Van den Bossche
>Priority: Blocker
>  Labels: pull-request-available, triaged
> Fix For: 11.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Presumably we should disallow this or convert it to a schema?
> https://github.com/duckdb/duckdb/issues/5143
> {noformat}
> >>> import pyarrow as pa
> >>> pa.__version__
> '10.0.0'
> >>> reader = pa.RecordBatchReader.from_batches({"a": pa.int8()}, [])
> >>> reader.schema
> fish: Job 1, 'python3' terminated by signal SIGSEGV (Address boundary error)
> (gdb) bt
> #0  0x74247580 in arrow::Schema::num_fields() const ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #1  0x742b93f7 in arrow::(anonymous namespace)::SchemaPrinter::Print()
> ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #2  0x742b98a7 in arrow::PrettyPrint(arrow::Schema const&, 
> arrow::PrettyPrintOptions const&, std::string*) ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #3  0x764f814b in 
> __pyx_pw_7pyarrow_3lib_6Schema_52to_string(_object*, _object*, _object*) ()
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17893) [Python] Bug: Wrong reading of timedelta

2022-11-09 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-17893.
---
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14531
[https://github.com/apache/arrow/pull/14531]

> [Python] Bug: Wrong reading of timedelta
> 
>
> Key: ARROW-17893
> URL: https://issues.apache.org/jira/browse/ARROW-17893
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
> Environment: macOS 12.6 on an Apple M1 Ultra
>Reporter: Yaser Alraddadi
>Assignee: Alenka Frim
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
> Attachments: check_timedelta.py
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When there is a timedelta and a list of dictionary that also has timedelta as 
> well, reading the upper timedelta in feather format sometimes gives wrong 
> reading.
> below is an example if you check the printed results sometime it reads the 
> upper timedelta as {color:#00875a}0 days 03:40:23 correct{color}, and 
> sometimes as {color:#de350b}153 days 01:03:20 wrong{color}
> Here is the code, also it is attached as check_timedelta.py
>  
> {code:java}
> from datetime import datetime, timedelta
> import pandas as pd
> import pyarrow.feather as feather
> time_1 = datetime.fromisoformat("2022-04-21T10:18:12+03:00")
> time_2 = datetime.fromisoformat("2022-04-21T13:58:35+03:00")
> data = [
>     {
>         "waiting_time": timedelta(seconds=12, microseconds=1),
>     },
>     {
>         "waiting_time": timedelta(seconds=1020),
>     },
>     {
>         "waiting_time": timedelta(seconds=960),
>     },
>     {
>         "waiting_time": timedelta(seconds=960),
>     },
>     {
>         "waiting_time": timedelta(seconds=960),
>     },
>     {
>         "waiting_time": timedelta(seconds=815, microseconds=1),
>     },
> ]
> df = pd.DataFrame(
>     [
>         {
>             "time_1": time_1,
>             "time_2": time_2,
>             "data": data,
>             "timedelta_1": time_2 - time_1,
>             "timedelta_2": timedelta(hours=3, minutes=40, seconds=23),
>         },
>     ]
> )
> print("Correct timedelta_1: ", df["timedelta_1"].item())
> print("Correct timedelta_2: ", df["timedelta_2"].item())
> with open(f"records.feather.lz4", "wb") as f:
>     feather.write_feather(df, f, compression="lz4")
> for _ in range(10):
> with open(f"records.feather.lz4", "rb") as f:
>     print("Reading timedelta_1: ", 
> feather.read_feather(f)["timedelta_1"].item())
>         print("Reading timedelta_2: ", 
> feather.read_feather(f)["timedelta_2"].item())
> {code}
>  
>  
> Printed Results
>  
> {code:java}
> Correct timedelta_1:  0 days 03:40:23
> Correct timedelta_2:  0 days 03:40:23
> Reading timedelta_1:  0 days 03:40:23
> Reading timedelta_2:  0 days 03:40:23
> Reading timedelta_1:  0 days 03:40:23
> Reading timedelta_2:  0 days 03:40:23
> Reading timedelta_1:  153 days 01:03:20
> Reading timedelta_2:  153 days 01:03:20
> Reading timedelta_1:  0 days 03:40:23
> Reading timedelta_2:  0 days 03:40:23
> Reading timedelta_1:  0 days 03:40:23
> Reading timedelta_2:  0 days 03:40:23
> Reading timedelta_1:  0 days 03:40:23
> Reading timedelta_2:  153 days 01:03:20
> Reading timedelta_1:  153 days 01:03:20
> Reading timedelta_2:  0 days 03:40:23
> Reading timedelta_1:  0 days 03:40:23
> Reading timedelta_2:  153 days 01:03:20
> Reading timedelta_1:  153 days 01:03:20
> Reading timedelta_2:  153 days 01:03:20
> Reading timedelta_1:  153 days 01:03:20
> Reading timedelta_2:  153 days 01:03:20{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18295) [C++] FieldRef::FindAll/FindOne(DataType) improve error

2022-11-09 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18295:
--
Summary: [C++] FieldRef::FindAll/FindOne(DataType) improve error  (was: 
[C++] FieldRef::FineAll/FindOne(DataType) improve error)

> [C++] FieldRef::FindAll/FindOne(DataType) improve error
> ---
>
> Key: ARROW-18295
> URL: https://issues.apache.org/jira/browse/ARROW-18295
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Miles Granger
>Assignee: Miles Granger
>Priority: Major
> Fix For: 11.0.0
>
>
> [GitHub PR 14495 | https://github.com/apache/arrow/pull/14495] adds support 
> for {{struct_field}} to accept string field names (as well as a mix of 
> indices/strings). A side effect is the error produced by 
> {{FieldRef::FindOne}} and by proxy {{FieldRef::FindAll}} does not give as 
> good of an error as the previous {{StructFieldFunctor::CheckIndex}} did. 
> See the [GitHub discussion here for more context | 
> https://github.com/apache/arrow/pull/14495#discussion_r1016325430]
> It would be good to have a similar error message given when using 
> {{FieldRef::FineOne}} on a {{DataType}}.
> Example error from {{StructFieldFunctor::CheckIndex}}:
> _out-of-bounds field reference to field 4 in type struct c: struct> with 3 fields_
> Error from {{FieldRef::FindOne}}:
> _No match for FieldRef.FieldPath(2 4) in struct struct>_



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18238) [Python] Improve docs for S3FileSystem / bucket region resolution

2022-11-09 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18238.
---
Resolution: Fixed

Issue resolved by pull request 14599
[https://github.com/apache/arrow/pull/14599]

> [Python] Improve docs for S3FileSystem / bucket region resolution
> -
>
> Key: ARROW-18238
> URL: https://issues.apache.org/jira/browse/ARROW-18238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Miles Granger
>Assignee: Miles Granger
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Should update the documentation surrounding {{S3Filesystem}} and how to 
> resolve {{region}}. ie. {{S3FileSystem.from_uri}}, 
> {{resolve_s3_region(<>)}}
> [R docs | 
> https://arrow.apache.org/docs/r/articles/fs.html#creating-a-filesystem-object]
>  is a good reference.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17360) [Python] Order of columns in pyarrow.feather.read_table

2022-11-09 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-17360.
---
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14528
[https://github.com/apache/arrow/pull/14528]

> [Python] Order of columns in pyarrow.feather.read_table
> ---
>
> Key: ARROW-17360
> URL: https://issues.apache.org/jira/browse/ARROW-17360
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 8.0.1
>Reporter: Matthew Roeschke
>Assignee: Alenka Frim
>Priority: Major
>  Labels: orc, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> xref [https://github.com/pandas-dev/pandas/issues/47944]
>  
> {code:java}
> In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
> # pandas main branch / 1.5
> In [2]: df.to_orc("abc")
> In [3]: pd.read_orc("abc", columns=['b', 'a'])
> Out[3]:
>a  b
> 0  1  a
> 1  2  b
> 2  3  c
> In [4]: import pyarrow.orc as orc
> In [5]: orc_file = orc.ORCFile("abc")
> # reordered to a, b
> In [6]: orc_file.read(columns=['b', 'a']).to_pandas()
> Out[6]:
>a  b
> 0  1  a
> 1  2  b
> 2  3  c
> # reordered to a, b
> In [7]: orc_file.read(columns=['b', 'a'])
> Out[7]:
> pyarrow.Table
> a: int64
> b: string
> 
> a: [[1,2,3]]
> b: [["a","b","c"]] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18173) [Python] Drop older versions of Pandas (<1.0)

2022-11-09 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18173:
--
Priority: Critical  (was: Major)

> [Python] Drop older versions of Pandas (<1.0)
> -
>
> Key: ARROW-18173
> URL: https://issues.apache.org/jira/browse/ARROW-18173
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alenka Frim
>Priority: Critical
> Fix For: 11.0.0
>
>
> We should drop older versions of pandas and support versions > 1.0.
> Older versions are frequently causing issues on the CI.
> Version 1.0.0 was released in January 29, 2020.
> The changes will have to be done in:
>  * the official documentation (pandas version support)
>  * the CI jobs supporting older pandas versions
>  * 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/pandas-shim.pxi]
>  * tests that are specifically testing features on older versions of pandas



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18293) [C++] Proxy memory pool crashes with Dataset scanning

2022-11-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-18293:
-

 Summary: [C++] Proxy memory pool crashes with Dataset scanning
 Key: ARROW-18293
 URL: https://issues.apache.org/jira/browse/ARROW-18293
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


Discovered while trying to use the proxy memory pool for testing ARROW-18164

See https://github.com/apache/arrow/pull/14516#discussion_r1005433867

This test segfaults (using the fixture in {{test_dataset.py}}:

{code:python}
@pytest.mark.parquet
def test_scanner_proxy_memory_pool(dataset):
proxy_pool = pa.proxy_memory_pool(pa.default_memory_pool())
_ = dataset.to_table(memory_pool=proxy_pool)
{code}

Response of [~westonpace]:

{quote}My guess is that the problem is that the scanner erroneously returns 
before all work is completely finished. Changing the thread pool or the memory 
pool too quickly after a scan can lead to this kind of error. The new scanner 
was created specifically to avoid this problem but it isn't the default yet 
(still working through some follow-up PRs to make sure we have the same 
functionality).{quote}

So once that becomes the default new scanner, we can see if this is fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17832) [Python] Construct MapArray from sequence of dicts (instead of list of tuples)

2022-11-09 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-17832.
---
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14547
[https://github.com/apache/arrow/pull/14547]

> [Python] Construct MapArray from sequence of dicts (instead of list of tuples)
> --
>
> Key: ARROW-17832
> URL: https://issues.apache.org/jira/browse/ARROW-17832
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Kshiteej K
>Priority: Major
>  Labels: pull-request-available, python-conversion
> Fix For: 11.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> From https://github.com/apache/arrow/issues/14116
> Creating a MapArray from a python sequence currently requires lists of tuples 
> as values:
> {code}
> arr = pa.array([[('a', 1), ('b', 2)], [('c', 3)]], pa.map_(pa.string(), 
> pa.int64()))
> {code}
> While I think it makes sense that the following could also work (using dicts 
> instead):
> {code}
> arr = pa.array([{'a': 1, 'b': 2}, {'c': 3}], pa.map_(pa.string(), pa.int64()))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18246) [Python][Docs] PyArrow table join docstring typos for left and right suffix arguments

2022-11-09 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18246.
---
Resolution: Fixed

Issue resolved by pull request 14591
[https://github.com/apache/arrow/pull/14591]

> [Python][Docs] PyArrow table join docstring typos for left and right suffix 
> arguments
> -
>
> Key: ARROW-18246
> URL: https://issues.apache.org/jira/browse/ARROW-18246
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation
>Reporter: d33bs
>Assignee: Will Jones
>Priority: Minor
>  Labels: docs-impacting, documentation, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Hello, thank you for all the amazing work on Arrow! I'd like to report a 
> potential issue with PyArrow's Table Join docstring which may make it 
> confusing for others to read. This content is I believe translated into the 
> documentation website as well.
> The content which needs to be corrected may be found starting at: 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L4737]
> The block currently reads:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to right column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffic to add to the left column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> It could be improved with the following:
> {code:java}
> left_suffix : str, default None
> Which suffix to add to left column names. This prevents confusion
> when the columns in left and right tables have colliding names.
> right_suffix : str, default None
> Which suffix to add to the right column names. This prevents confusion
> when the columns in left and right tables have colliding names.{code}
> Please let me know if I may clarify or if there are any questions on the 
> above. Thanks again for your help!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17892) [CI] Use Python 3.10 in AppVeyor build

2022-11-09 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-17892.
---
Fix Version/s: 11.0.0
   Resolution: Fixed

Issue resolved by pull request 14307
[https://github.com/apache/arrow/pull/14307]

> [CI] Use Python 3.10 in AppVeyor build
> --
>
> Key: ARROW-17892
> URL: https://issues.apache.org/jira/browse/ARROW-17892
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We should change AppVeyor setup 
> [https://github.com/apache/arrow/blob/master/ci/appveyor-cpp-setup.bat]
> to use Python 3.10 and remove {{CONDA_DLL_SEARCH_MODIFICATION_ENABLE}} as 
> this env var is not needed anymore in 3.10 to successfully find 
> {{arrow_python}} lib located in the {{arrow/python/pyarrow}} folder.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18226) [Python] pyarrow.lib.ArrowInvalid: Not a Feather V1 or Arrow IPC file

2022-11-09 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630860#comment-17630860
 ] 

Joris Van den Bossche commented on ARROW-18226:
---

Could you please show this as the output of the same script to read the file, 
or in an interactive python session? (pip might be picking up a different 
pyarrow than the python the is used to run the script)

> [Python] pyarrow.lib.ArrowInvalid: Not a Feather V1 or Arrow IPC file
> -
>
> Key: ARROW-18226
> URL: https://issues.apache.org/jira/browse/ARROW-18226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.0
> Environment: ubuntu
>Reporter: bade tutuş
>Priority: Major
> Attachments: HeLa-S3-training.feather
>
>
> h2. feather.read_dataframe throws the error below.
> Traceback (most recent call last):
>   File "./cv.py", line 86, in 
>     get_training('HeLa-S3'),
>   File "./cv.py", line 19, in get_training
>     
> feather.read_dataframe(f'\{cell_line}-training.feather').set_index(['chr1', 
> 'x1', 'x2', 'chr2', 'y1', 'y2'])
>   File "/home/bade/.local/lib/python3.7/site-packages/pyarrow/feather.py", 
> line 208, in read_feather
>     return (read_table(source, columns=columns, memory_map=memory_map)
>   File "/home/bade/.local/lib/python3.7/site-packages/pyarrow/feather.py", 
> line 230, in read_table
>     reader.open(source, use_memory_map=memory_map)
>   File "pyarrow/feather.pxi", line 67, in pyarrow.lib.FeatherReader.open
>   File "pyarrow/error.pxi", line 123, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Not a Feather V1 or Arrow IPC file



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18123) [Python] Cannot use multi-byte characters in file names in write_table

2022-11-09 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630859#comment-17630859
 ] 

Joris Van den Bossche commented on ARROW-18123:
---

Yes, we certainly support relative paths. Targetting this for 11.0 since it is 
actually a regression compared to older pyarrow versions (before we used 
pyarrow.fs filesystem in write_table, this worked fine)


> [Python] Cannot use multi-byte characters in file names in write_table
> --
>
> Key: ARROW-18123
> URL: https://issues.apache.org/jira/browse/ARROW-18123
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Critical
>
> Error when specifying a file path containing multi-byte characters in 
> {{pyarrow.parquet.write_table}}.
> For example, use {{例.parquet}} as the file path.
> {code:python}
> Python 3.10.7 (main, Oct  5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pandas as pd
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
> ...'two': ['foo', 'bar', 'baz'],
> ...'three': [True, False, True]},
> ...index=list('abc'))
> >>> table = pa.Table.from_pandas(df)
> >>> import pyarrow.parquet as pq
> >>> pq.write_table(table, '例.parquet')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 2920, in write_table
> with ParquetWriter(
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 911, in __init__
> filesystem, path = _resolve_filesystem_and_path(
>   File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line
> 184, in _resolve_filesystem_and_path
> filesystem, path = FileSystem.from_uri(path)
>   File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18123) [Python] Cannot use multi-byte characters in file names in write_table

2022-11-09 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18123:
--
Fix Version/s: 11.0.0

> [Python] Cannot use multi-byte characters in file names in write_table
> --
>
> Key: ARROW-18123
> URL: https://issues.apache.org/jira/browse/ARROW-18123
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Priority: Critical
> Fix For: 11.0.0
>
>
> Error when specifying a file path containing multi-byte characters in 
> {{pyarrow.parquet.write_table}}.
> For example, use {{例.parquet}} as the file path.
> {code:python}
> Python 3.10.7 (main, Oct  5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pandas as pd
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
> ...'two': ['foo', 'bar', 'baz'],
> ...'three': [True, False, True]},
> ...index=list('abc'))
> >>> table = pa.Table.from_pandas(df)
> >>> import pyarrow.parquet as pq
> >>> pq.write_table(table, '例.parquet')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 2920, in write_table
> with ParquetWriter(
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 911, in __init__
> filesystem, path = _resolve_filesystem_and_path(
>   File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line
> 184, in _resolve_filesystem_and_path
> filesystem, path = FileSystem.from_uri(path)
>   File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18257) [Python] array of time64 type changes from Time64Type to DataType

2022-11-07 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630252#comment-17630252
 ] 

Joris Van den Bossche commented on ARROW-18257:
---

Yes, thanks for the report!

This case seems to be missing here: 
https://github.com/apache/arrow/blob/72d098b6642424696a3cab34b952196336b28a9a/python/pyarrow/public-api.pxi#L74-L121
 
(and taking a further look, also Time32Type is missing, but those two seem to 
be the only ones with a custom DataType subclass that are missing)

A fix would certainly be welcome if you would be interested!


> [Python] array of time64 type changes from Time64Type to DataType
> -
>
> Key: ARROW-18257
> URL: https://issues.apache.org/jira/browse/ARROW-18257
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.0
> Environment: python 3.9
> pyarrow 10.0.0
> No pandas installed
>Reporter: &res
>Priority: Minor
>
> When creating an array of time64 elements, the array type information is 
> changed from Time64Type to DataType. 
> While it's not an issue as such, given it still looks like an array of 
> time64, I can't access special attributes of the Time64Type (for example unit)
>  
> {code:java}
> dtype = pa.time64("ns")
> time_array = pa.array(
> [
> 1,
> 2,
> 3
> ],
> dtype
> )
> assert pa.types.is_time64(time_array.type) is True
> assert isinstance(dtype, pa.Time64Type) is True
> assert isinstance(time_array.type, pa.Time64Type) is False # Wrong
> assert isinstance(time_array.type, pa.DataType) is True # Wrong
> assert dtype == time_array.type
> assert dtype.unit == "ns"
> with pytest.raises(AttributeError, match=r"'pyarrow.lib.DataType' object has 
> no attribute 'unit'"):
> # Should be able to access unit:
> time_array.type.unit{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18257) [Python] array of time64 type changes from Time64Type to DataType

2022-11-07 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18257:
--
Fix Version/s: 11.0.0

> [Python] array of time64 type changes from Time64Type to DataType
> -
>
> Key: ARROW-18257
> URL: https://issues.apache.org/jira/browse/ARROW-18257
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 10.0.0
> Environment: python 3.9
> pyarrow 10.0.0
> No pandas installed
>Reporter: &res
>Priority: Minor
> Fix For: 11.0.0
>
>
> When creating an array of time64 elements, the array type information is 
> changed from Time64Type to DataType. 
> While it's not an issue as such, given it still looks like an array of 
> time64, I can't access special attributes of the Time64Type (for example unit)
>  
> {code:java}
> dtype = pa.time64("ns")
> time_array = pa.array(
> [
> 1,
> 2,
> 3
> ],
> dtype
> )
> assert pa.types.is_time64(time_array.type) is True
> assert isinstance(dtype, pa.Time64Type) is True
> assert isinstance(time_array.type, pa.Time64Type) is False # Wrong
> assert isinstance(time_array.type, pa.DataType) is True # Wrong
> assert dtype == time_array.type
> assert dtype.unit == "ns"
> with pytest.raises(AttributeError, match=r"'pyarrow.lib.DataType' object has 
> no attribute 'unit'"):
> # Should be able to access unit:
> time_array.type.unit{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17820) [C++] Implement arithmetic kernels on List(number)

2022-11-04 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17820:
--
Summary: [C++] Implement arithmetic kernels on List(number)  (was: 
Implement arithmetic kernels on List(number))

> [C++] Implement arithmetic kernels on List(number)
> --
>
> Key: ARROW-17820
> URL: https://issues.apache.org/jira/browse/ARROW-17820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Adam Lippai
>Priority: Major
>  Labels: kernel, query-engine
>
> eg. rounding in list(float64()), similar to a map or foreach



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17820) Implement arithmetic kernels on List(number)

2022-11-04 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629186#comment-17629186
 ] 

Joris Van den Bossche commented on ARROW-17820:
---

It would be nice if we would have a way that all unary scalar kernels could be 
applied on list arrays (indeed by being applied to the single child array of 
flat values). 

I think in SQL one could do this with a subquery with unnesting and aggregating 
again (eg 
https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#creating_arrays_from_subqueries,
 although that example is actually not a unary kernel but a binary).

Such an approach doesn't really fit our kernels / Acero, I think. One option 
could be to have a generic kernel to "map" another kernel on the list values. 
Like

{code}
list_map_function(list_array, "kernel_name", FunctionOptions)
{code}

where you can pass the function name you want to apply, and a FunctionOptions 
object matching the kernel. Would something like this be possible technically?

Another option could be to directly register list type for unary kernels? (in 
many cases there might be no ambiguity about that we expect the function to be 
applied to each value in the list, instead of applied to each list. For example 
for {{round(list)}} or {{ascii_lower(list)}})




> Implement arithmetic kernels on List(number)
> 
>
> Key: ARROW-17820
> URL: https://issues.apache.org/jira/browse/ARROW-17820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Adam Lippai
>Priority: Major
>  Labels: kernel, query-engine
>
> eg. rounding in list(float64()), similar to a map or foreach



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17820) Implement arithmetic kernels on List(number)

2022-11-04 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-17820:
--
Labels: kernel query-engine  (was: )

> Implement arithmetic kernels on List(number)
> 
>
> Key: ARROW-17820
> URL: https://issues.apache.org/jira/browse/ARROW-17820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Adam Lippai
>Priority: Major
>  Labels: kernel, query-engine
>
> eg. rounding in list(float64()), similar to a map or foreach



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18251) [CI][Python] AMD64 macOS 11 Python 3 job fails on master on pip install

2022-11-04 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629013#comment-17629013
 ] 

Joris Van den Bossche commented on ARROW-18251:
---

Not directly an idea. This build has been failing for some time (with a cython 
test failure) but it seems only recently started to fail with this installation 
issue.

>From commits on master, 4 days ago test failure: 
>https://github.com/apache/arrow/actions/runs/3363608185/jobs/5576970377
3 days ago installation failure: 
https://github.com/apache/arrow/actions/runs/3372998317/jobs/5597074003

A relevant difference is a pip 22.2.2 -> 22.3 update.

> [CI][Python] AMD64 macOS 11 Python 3 job fails on master on pip install
> ---
>
> Key: ARROW-18251
> URL: https://issues.apache.org/jira/browse/ARROW-18251
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Raúl Cumplido
>Priority: Critical
> Fix For: 11.0.0
>
>
> Currently the job for AMD64 macOS 11 Python 3 is failing:
> [https://github.com/apache/arrow/actions/runs/3388587979/jobs/5630747309]
> with:
> {code:java}
>  + python3 -m pip install --no-deps --no-build-isolation -vv .
> ~/work/arrow/arrow/python ~/work/arrow/arrow
> Using pip 22.3 from 
> /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pip
>  (python 3.11)
> Non-user install because site-packages writeable
> Created temporary directory: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw
> Initialized build tracking at 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw
> Created build tracker: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw
> Entered build tracker: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw
> Created temporary directory: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-install-9ku2dtx5
> Created temporary directory: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-ephem-wheel-cache-za6jhm0e
> Processing /Users/runner/work/arrow/arrow/python
>   Added file:///Users/runner/work/arrow/arrow/python to build tracker 
> '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-build-tracker-ib8gr4sw'
>   Created temporary directory: 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16
>   Preparing metadata (pyproject.toml): started
>   Running command Preparing metadata (pyproject.toml)
>   running dist_info
>   creating 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info
>   writing 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/PKG-INFO
>   writing dependency_links to 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/dependency_links.txt
>   writing entry points to 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/entry_points.txt
>   writing requirements to 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/requires.txt
>   writing top-level names to 
> /private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/top_level.txt
>   writing manifest file 
> '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/SOURCES.txt'
>   reading manifest template 'MANIFEST.in'
>   warning: no previously-included files matching '*.so' found anywhere in 
> distribution
>   warning: no previously-included files matching '*.pyc' found anywhere in 
> distribution
>   warning: no previously-included files matching '*~' found anywhere in 
> distribution
>   warning: no previously-included files matching '#*' found anywhere in 
> distribution
>   warning: no previously-included files matching '.DS_Store' found anywhere 
> in distribution
>   no previously-included directories found matching '.asv'
>   adding license file '../LICENSE.txt'
>   adding license file '../NOTICE.txt'
>   writing manifest file 
> '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow.egg-info/SOURCES.txt'
>   creating 
> '/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvmgn/T/pip-modern-metadata-yqpsyw16/pyarrow-11.0.0.dev55+g8e3a1e1b7.dist-info'
>   error: invalid command 'bdist_wheel'
>   error: subprocess-exited-with-error
>   
>   × Preparing metadata (pyproject.toml) did not run successfully.
>   │ exit code: 1
>   ╰─> See above for output.
>   
>   note: This error originates from a su

[jira] [Commented] (ARROW-18185) [C++][Compute] Support KEEP_NULL option for compute::Filter

2022-11-04 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628751#comment-17628751
 ] 

Joris Van den Bossche commented on ARROW-18185:
---

> What about implementing this as an specialized optimization for if_else AAS 
> case,

That sounds good to me

> [C++][Compute] Support KEEP_NULL option for compute::Filter
> ---
>
> Key: ARROW-18185
> URL: https://issues.apache.org/jira/browse/ARROW-18185
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Jin Shang
>Assignee: Jin Shang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The current Filter implementation always drops the filtered values. In some 
> use cases, it's desirable for the output array to have the same size as the 
> inut array. So I added a new option FilterOptions::KEEP_NULL where the 
> filtered values are kept as nulls.
> For example, with input [1, 2, 3] and filter [true, false, true], the current 
> implementation will output [1, 3] and with the new option it will output [1, 
> null, 3]
> This option is simpler to implement since we only need to construct a new 
> validity bitmap and reuse the input buffers and child arrays. Except for 
> dense union arrays which don't have validity bitmaps.
> It is also faster to filter with FilterOptions::KEEP_NULL according to the 
> benchmark result in most cases. So users can choose this option for better 
> performance when dropping filtered values is not required.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-12739) [C++] Function to combine Arrays row-wise into ListArray

2022-11-04 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-12739:
--
Labels: kernel query-engine  (was: )

> [C++] Function to combine Arrays row-wise into ListArray
> 
>
> Key: ARROW-12739
> URL: https://issues.apache.org/jira/browse/ARROW-12739
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Priority: Major
>  Labels: kernel, query-engine
>
> Add a variadic function that would take 2+ Arrays and combine/transpose them 
> rowwise into a ListArray. For example:
>  Input:
> {code:java}
> ArrayArray
> [[
>   "foo",   "bar",
>   "push"   "pop"
> ]]
>  {code}
> Output:
> {code:java}
> ListArray>
> [
>   ["foo","bar"],
>   ["push","pop"]
> ]
> {code}
> This is similar to the StructArray constructor which takes a list of Arrays 
> and names (but in this case it would only need to take a list of Arrays).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18229) [C++][Python] RecordBatchReader can be created with a 'dict' schema which then crashes on use

2022-11-03 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628423#comment-17628423
 ] 

Joris Van den Bossche commented on ARROW-18229:
---

I opened a PR to just ensure the argument has to be a schema (I like the idea 
of allowing a dictionary, but that's something we should then also consider in 
other places, starting with creating a schema in {{pa.schema(..)}}, I think).

It's a bit peculiar that we require a Schema for RecordBatchReader.from_batches 
({{PyRecordBatchReader}}, but then don't actually use that schema for anything 
(except for accessing the {{schema}} attribute of the reader). Since reading 
will work fine in the above example, and also happily returns batches of a 
different schema than the one you specified.

> [C++][Python] RecordBatchReader can be created with a 'dict' schema which 
> then crashes on use
> -
>
> Key: ARROW-18229
> URL: https://issues.apache.org/jira/browse/ARROW-18229
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 10.0.0
>Reporter: David Li
>Assignee: Joris Van den Bossche
>Priority: Blocker
>  Labels: pull-request-available, triaged
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Presumably we should disallow this or convert it to a schema?
> https://github.com/duckdb/duckdb/issues/5143
> {noformat}
> >>> import pyarrow as pa
> >>> pa.__version__
> '10.0.0'
> >>> reader = pa.RecordBatchReader.from_batches({"a": pa.int8()}, [])
> >>> reader.schema
> fish: Job 1, 'python3' terminated by signal SIGSEGV (Address boundary error)
> (gdb) bt
> #0  0x74247580 in arrow::Schema::num_fields() const ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #1  0x742b93f7 in arrow::(anonymous namespace)::SchemaPrinter::Print()
> ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #2  0x742b98a7 in arrow::PrettyPrint(arrow::Schema const&, 
> arrow::PrettyPrintOptions const&, std::string*) ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #3  0x764f814b in 
> __pyx_pw_7pyarrow_3lib_6Schema_52to_string(_object*, _object*, _object*) ()
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18229) [C++][Python] RecordBatchReader can be created with a 'dict' schema which then crashes on use

2022-11-03 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628416#comment-17628416
 ] 

Joris Van den Bossche commented on ARROW-18229:
---

What is causing the segfault here is actually the _printing_ of the (null) 
schema ({{shared_ptr[CSchema]()}}). Although we should still disallow creating 
a RecordBatchReader with such a schema, of course.

> [C++][Python] RecordBatchReader can be created with a 'dict' schema which 
> then crashes on use
> -
>
> Key: ARROW-18229
> URL: https://issues.apache.org/jira/browse/ARROW-18229
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 10.0.0
>Reporter: David Li
>Assignee: Joris Van den Bossche
>Priority: Blocker
>  Labels: triaged
>
> Presumably we should disallow this or convert it to a schema?
> https://github.com/duckdb/duckdb/issues/5143
> {noformat}
> >>> import pyarrow as pa
> >>> pa.__version__
> '10.0.0'
> >>> reader = pa.RecordBatchReader.from_batches({"a": pa.int8()}, [])
> >>> reader.schema
> fish: Job 1, 'python3' terminated by signal SIGSEGV (Address boundary error)
> (gdb) bt
> #0  0x74247580 in arrow::Schema::num_fields() const ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #1  0x742b93f7 in arrow::(anonymous namespace)::SchemaPrinter::Print()
> ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #2  0x742b98a7 in arrow::PrettyPrint(arrow::Schema const&, 
> arrow::PrettyPrintOptions const&, std::string*) ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #3  0x764f814b in 
> __pyx_pw_7pyarrow_3lib_6Schema_52to_string(_object*, _object*, _object*) ()
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18229) [C++][Python] RecordBatchReader can be created with a 'dict' schema which then crashes on use

2022-11-03 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-18229:
-

Assignee: Joris Van den Bossche

> [C++][Python] RecordBatchReader can be created with a 'dict' schema which 
> then crashes on use
> -
>
> Key: ARROW-18229
> URL: https://issues.apache.org/jira/browse/ARROW-18229
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 10.0.0
>Reporter: David Li
>Assignee: Joris Van den Bossche
>Priority: Blocker
>  Labels: triaged
>
> Presumably we should disallow this or convert it to a schema?
> https://github.com/duckdb/duckdb/issues/5143
> {noformat}
> >>> import pyarrow as pa
> >>> pa.__version__
> '10.0.0'
> >>> reader = pa.RecordBatchReader.from_batches({"a": pa.int8()}, [])
> >>> reader.schema
> fish: Job 1, 'python3' terminated by signal SIGSEGV (Address boundary error)
> (gdb) bt
> #0  0x74247580 in arrow::Schema::num_fields() const ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #1  0x742b93f7 in arrow::(anonymous namespace)::SchemaPrinter::Print()
> ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #2  0x742b98a7 in arrow::PrettyPrint(arrow::Schema const&, 
> arrow::PrettyPrintOptions const&, std::string*) ()
>from 
> /home/lidavidm/miniconda3/lib/python3.9/site-packages/pyarrow/libarrow.so.1000
> #3  0x764f814b in 
> __pyx_pw_7pyarrow_3lib_6Schema_52to_string(_object*, _object*, _object*) ()
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18226) [Python] pyarrow.lib.ArrowInvalid: Not a Feather V1 or Arrow IPC file

2022-11-03 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628380#comment-17628380
 ] 

Joris Van den Bossche commented on ARROW-18226:
---

Can you print {{pyarrow.__version__}} in the script where you get this error, 
to ensure you are really using a latest version?

Another question: how did you install pyarrow?

> [Python] pyarrow.lib.ArrowInvalid: Not a Feather V1 or Arrow IPC file
> -
>
> Key: ARROW-18226
> URL: https://issues.apache.org/jira/browse/ARROW-18226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.0
> Environment: ubuntu
>Reporter: bade tutuş
>Priority: Major
> Attachments: HeLa-S3-training.feather
>
>
> h2. feather.read_dataframe throws the error below.
> Traceback (most recent call last):
>   File "./cv.py", line 86, in 
>     get_training('HeLa-S3'),
>   File "./cv.py", line 19, in get_training
>     
> feather.read_dataframe(f'\{cell_line}-training.feather').set_index(['chr1', 
> 'x1', 'x2', 'chr2', 'y1', 'y2'])
>   File "/home/bade/.local/lib/python3.7/site-packages/pyarrow/feather.py", 
> line 208, in read_feather
>     return (read_table(source, columns=columns, memory_map=memory_map)
>   File "/home/bade/.local/lib/python3.7/site-packages/pyarrow/feather.py", 
> line 230, in read_table
>     reader.open(source, use_memory_map=memory_map)
>   File "pyarrow/feather.pxi", line 67, in pyarrow.lib.FeatherReader.open
>   File "pyarrow/error.pxi", line 123, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Not a Feather V1 or Arrow IPC file



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >