[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-31 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598742#comment-17598742
 ] 

Micah Kornfield commented on ARROW-17459:
-

You would have to follow this up the stack from the previous comments.  Without 
seeing the stack trace it is a bit hard to give guidance, but i'd guess there 
are few places that always expected BinaryArray/BinaryBuilder in the linked 
code and might down_cast, these would need to be adjusted accordingly.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a parquet file (about 23MB with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). 
The result table and dataframe have only a few rows (1 row 20kb, 500 rows 
20MB). Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

When reading the same parquet file without filtering, the memory usage is about 
the same at 900MB. It goes up to 2.3GB after to_pandas dataframe,. 
df.info(memory_usage='deep') shows 4.3GB maybe double counting something.

The filtered column is not a partition key, which functionally works to get the 
correct rows. But the memory usage is quite high even when the parquet file is 
not really large, partitioned or not. There were some references similar to 
this issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with read filtering? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!

  was:
Hi,
When I read a parquet file (about 23mb with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). 
The result table and dataframe have only a few rows (1 row 20kb, 500 rows 
20MB). Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

When reading the same parquet file without filtering, the memory usage is about 
the same at 900MB, and goes up to 2.3GB after to_pandas dataframe,. 
df.info(memory_usage='deep') shows 4.3GB maybe double counting something.

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file without filte

[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a parquet file (about 23mb with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). 
The result table and dataframe have only a few rows (1 row 20kb, 500 rows 
20MB). Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

When reading the same parquet file without filtering, the memory usage is about 
the same at 900MB, and goes up to 2.3GB after to_pandas dataframe,. 
df.info(memory_usage='deep') shows 4.3GB maybe double counting something.

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!

  was:
Hi,
When I read a parquet file (about 23mb with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). 
The result table and dataframe have only a few rows (1 row 20kb, 500 rows 
20MB). Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a parquet file (about 23mb with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file without filtering, the memory usage is 
> about the same at 900MB, and goes up to 2.3GB after to_pandas dataframe,. 
> df.info(memory_usage='deep') shows 4.3GB maybe double counting something.
> The filtered column is not a partition key, which functionally works 

[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a parquet file (about 23mb with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). 
The result table and dataframe have only a few rows (1 row 20kb, 500 rows 
20MB). Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!

  was:
Hi,
When I read a parquet file (about 23mb with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (close to 1GB). The 
result table and dataframe have only a few rows (1 row 20kb, 500 rows 20MB). 
Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a parquet file (about 23mb with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> The filtered column is not a partition key, which functionally works to get a 
> small number of rows. But the memory usage is high when the parquet 
> (partitioned or not) is large. There were some references related to this 
> issue, for example: [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played wit

[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a parquet file (about 23mb with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (close to 1GB). The 
result table and dataframe have only a few rows (1 row 20kb, 500 rows 20MB). 
Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!

  was:
Hi,
When I read a parquet file (about 23mb with 250K rows and 600 object/string 
columns with lots of None) with filter for a small number of rows (e.g. 1 to 
500), the memory usage is pretty high (close to 1GB). The result table and 
dataframe have only a few rows (at about 20MB). Looks like it scans/loads many 
rows from the parquet file. Not only the footprint or watermark of memory usage 
is high, but also it seems not releasing the memory in time (such as after GC 
in Python, but may get used for subsequent read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a parquet file (about 23mb with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (close to 
> 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> The filtered column is not a partition key, which functionally works to get a 
> small number of rows. But the memory usage is high when the parquet 
> (partitioned or not) is large. There were some references related to this 
> issue, for example: [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._

[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a parquet file (about 23mb with 250K rows and 600 object/string 
columns with lots of None) with filter for a small number of rows (e.g. 1 to 
500), the memory usage is pretty high (close to 1GB). The result table and 
dataframe have only a few rows (at about 20MB). Looks like it scans/loads many 
rows from the parquet file. Not only the footprint or watermark of memory usage 
is high, but also it seems not releasing the memory in time (such as after GC 
in Python, but may get used for subsequent read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!

  was:
Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a parquet file (about 23mb with 250K rows and 600 object/string 
> columns with lots of None) with filter for a small number of rows (e.g. 1 to 
> 500), the memory usage is pretty high (close to 1GB). The result table and 
> dataframe have only a few rows (at about 20MB). Looks like it scans/loads 
> many rows from the parquet file. Not only the footprint or watermark of 
> memory usage is high, but also it seems not releasing the memory in time 
> (such as after GC in Python, but may get used for subsequent read).
> The filtered column is not a partition key, which functionally works to get a 
> small number of rows. But the memory usage is high when the parquet 
> (partitioned or not) is large. There were some references related to this 
> issue, for example: [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes

[jira] [Closed] (ARROW-17591) Arrow is NOT working with Java 17

2022-08-31 Thread evan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

evan closed ARROW-17591.

Resolution: Not A Problem

> Arrow is NOT working with Java 17
> -
>
> Key: ARROW-17591
> URL: https://issues.apache.org/jira/browse/ARROW-17591
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 9.0.0
>Reporter: evan
>Priority: Major
>
> h2. Running environment:
>  * OS: mac monterey
>  * Language: Java
>  * Java version: 17.0.2
>  * Arrow version: 9.0.0
> h2. Issue:
>  
> {code:java}
> java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.arrow.memory.util.MemoryUtil
>         at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1175)
>         at 
> org.apache.arrow.vector.BaseFixedWidthVector.initValidityBuffer(BaseFixedWidthVector.java:216)
>         at 
> org.apache.arrow.vector.BaseFixedWidthVector.zeroVector(BaseFixedWidthVector.java:210)
>         at 
> org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:342)
>         at 
> org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:309)
>         at 
> org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:274)
> {code}
> h2. Things have done
>  * adding --add-opens java.base/java.nio=ALL-UNNAMED to jvm args



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!

  was:
Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't see if it can be patched in the Python 
code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a large parquet file with filter for a small number of rows, the 
> memory usage is pretty high. The result table and dataframe have only a few 
> rows. Looks like it scans/loads many rows from the parquet file. Not only the 
> footprint or watermark of memory usage is high, but also it seems not 
> releasing the memory in time (such as after GC in Python, but may get used 
> for subsequent read).
> The filtered column is not a partition key, which functionally works to get a 
> small number of rows. But the memory usage is high when the parquet 
> (partitioned or not) is large. There were some references related to this 
> issue, for example: [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with filters? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us. I don't know what may be involved with 
> respect to the parquet columnar format, and if it can be patched somehow in 
> the Pyarrow Python code, or need to change and build the arrow C++ code.

[jira] [Updated] (ARROW-17591) Arrow is NOT working with Java 17

2022-08-31 Thread evan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

evan updated ARROW-17591:
-
Description: 
h2. Running environment:
 * OS: mac monterey
 * Language: Java
 * Java version: 17.0.2
 * Arrow version: 9.0.0

h2. Issue:

 
{code:java}
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.arrow.memory.util.MemoryUtil
        at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1175)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.initValidityBuffer(BaseFixedWidthVector.java:216)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.zeroVector(BaseFixedWidthVector.java:210)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:342)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:309)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:274)
{code}
h2. Things have done
 * adding --add-opens java.base/java.nio=ALL-UNNAMED to jvm args

  was:
h2. Running environment:
 * OS: mac monterey
 * Language: Java
 * Java version: 17.0.2
 * Arrow version: 9.0.0

h2. Issue:

 
{code:java}
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.arrow.memory.util.MemoryUtil
        at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1175)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.initValidityBuffer(BaseFixedWidthVector.java:216)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.zeroVector(BaseFixedWidthVector.java:210)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:342)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:309)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:274)
        at 
com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:65)
        at 
com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:53)
 {code}
 
h2. Things have done
 * adding --add-opens java.base/java.nio=ALL-UNNAMED to jvm args


> Arrow is NOT working with Java 17
> -
>
> Key: ARROW-17591
> URL: https://issues.apache.org/jira/browse/ARROW-17591
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 9.0.0
>Reporter: evan
>Priority: Major
>
> h2. Running environment:
>  * OS: mac monterey
>  * Language: Java
>  * Java version: 17.0.2
>  * Arrow version: 9.0.0
> h2. Issue:
>  
> {code:java}
> java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.arrow.memory.util.MemoryUtil
>         at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1175)
>         at 
> org.apache.arrow.vector.BaseFixedWidthVector.initValidityBuffer(BaseFixedWidthVector.java:216)
>         at 
> org.apache.arrow.vector.BaseFixedWidthVector.zeroVector(BaseFixedWidthVector.java:210)
>         at 
> org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:342)
>         at 
> org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:309)
>         at 
> org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:274)
> {code}
> h2. Things have done
>  * adding --add-opens java.base/java.nio=ALL-UNNAMED to jvm args



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17591) Arrow is NOT working with Java 17

2022-08-31 Thread evan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

evan updated ARROW-17591:
-
Description: 
h2. Running environment:
 * OS: mac monterey
 * Language: Java
 * Java version: 17.0.2
 * Arrow version: 9.0.0

h2. Issue:

 
{code:java}
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.arrow.memory.util.MemoryUtil
        at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1175)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.initValidityBuffer(BaseFixedWidthVector.java:216)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.zeroVector(BaseFixedWidthVector.java:210)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:342)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:309)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:274)
        at 
com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:65)
        at 
com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:53)
 {code}
 
h2. Things have done
 * adding --add-opens java.base/java.nio=ALL-UNNAMED to jvm args

  was:
## Running environment:
 * OS: mac monterey
 * Language: Java
 * Java version: 17.0.2
 * Arrow version: 9.0.0

## Issue:

 
{code:java}
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.arrow.memory.util.MemoryUtil
        at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1175)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.initValidityBuffer(BaseFixedWidthVector.java:216)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.zeroVector(BaseFixedWidthVector.java:210)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:342)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:309)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:274)
        at 
com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:65)
        at 
com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:53)
 {code}
 

 

## Things have done to try to fix it:

1. adding --add-opens java.base/java.nio=ALL-UNNAMED to jvm args


> Arrow is NOT working with Java 17
> -
>
> Key: ARROW-17591
> URL: https://issues.apache.org/jira/browse/ARROW-17591
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 9.0.0
>Reporter: evan
>Priority: Major
>
> h2. Running environment:
>  * OS: mac monterey
>  * Language: Java
>  * Java version: 17.0.2
>  * Arrow version: 9.0.0
> h2. Issue:
>  
> {code:java}
> java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.arrow.memory.util.MemoryUtil
>         at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1175)
>         at 
> org.apache.arrow.vector.BaseFixedWidthVector.initValidityBuffer(BaseFixedWidthVector.java:216)
>         at 
> org.apache.arrow.vector.BaseFixedWidthVector.zeroVector(BaseFixedWidthVector.java:210)
>         at 
> org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:342)
>         at 
> org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:309)
>         at 
> org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:274)
>         at 
> com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:65)
>         at 
> com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:53)
>  {code}
>  
> h2. Things have done
>  * adding --add-opens java.base/java.nio=ALL-UNNAMED to jvm args



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17591) Arrow is NOT working with Java 17

2022-08-31 Thread evan (Jira)
evan created ARROW-17591:


 Summary: Arrow is NOT working with Java 17
 Key: ARROW-17591
 URL: https://issues.apache.org/jira/browse/ARROW-17591
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 9.0.0
Reporter: evan


## Running environment:
 * OS: mac monterey
 * Language: Java
 * Java version: 17.0.2
 * Arrow version: 9.0.0

## Issue:

 
{code:java}
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.arrow.memory.util.MemoryUtil
        at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1175)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.initValidityBuffer(BaseFixedWidthVector.java:216)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.zeroVector(BaseFixedWidthVector.java:210)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:342)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:309)
        at 
org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:274)
        at 
com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:65)
        at 
com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:53)
 {code}
 

 

## Things have done to try to fix it:

1. adding --add-opens java.base/java.nio=ALL-UNNAMED to jvm args



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't see if it can be patched in the Python 
code.

Thanks!

  was:
Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: 
[https://github.com/apache/arrow/issues/7338|https://github.com/apache/arrow/issues/7338.]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't see if it can be patched in the Python 
code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a large parquet file with filter for a small number of rows, the 
> memory usage is pretty high. The result table and dataframe have only a few 
> rows. Looks like it scans/loads many rows from the parquet file. Not only the 
> footprint or watermark of memory usage is high, but also it seems not 
> releasing the memory in time (such as after GC in Python, but may get used 
> for subsequent read).
> The filtered column is not a partition key, which functionally works to get a 
> small number of rows. But the memory usage is high when the parquet 
> (partitioned or not) is large. There were some references related to this 
> issue, for example: [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with filters? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us. I don't see if it can be patched in the 
> Python code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: 
[https://github.com/apache/arrow/issues/7338|https://github.com/apache/arrow/issues/7338.]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't see if it can be patched in the Python 
code.

Thanks!

  was:
Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338.] 

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't see if it can be patched in the Python 
code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a large parquet file with filter for a small number of rows, the 
> memory usage is pretty high. The result table and dataframe have only a few 
> rows. Looks like it scans/loads many rows from the parquet file. Not only the 
> footprint or watermark of memory usage is high, but also it seems not 
> releasing the memory in time (such as after GC in Python, but may get used 
> for subsequent read).
> The filtered column is not a partition key, which functionally works to get a 
> small number of rows. But the memory usage is high when the parquet 
> (partitioned or not) is large. There were some references related to this 
> issue, for example: 
> [https://github.com/apache/arrow/issues/7338|https://github.com/apache/arrow/issues/7338.]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with filters? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us. I don't see if it can be patched in the 
> Python code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)
Yin created ARROW-17590:
---

 Summary: Lower memory usage with filters
 Key: ARROW-17590
 URL: https://issues.apache.org/jira/browse/ARROW-17590
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Yin


Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338.] 

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't see if it can be patched in the Python 
code.

Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17580) [Doc][C++][Python] Unclear how to influence compilation flags

2022-08-31 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598685#comment-17598685
 ] 

Kouhei Sutou commented on ARROW-17580:
--

{quote}
For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while the 
{{CXXFLAGS}} environment variable is ignored (it probably shouldn't?).
{quote}

Really? {{CXXFLAGS}} environment variable should be respected because we don't 
override {{CMAKE_CXX_FLAGS}} that is initialized with {{CXXFLAGS}} environment 
variable.
See also:
* https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L555
* https://cmake.org/cmake/help/latest/envvar/CXXFLAGS.html

For PyArrow:
* We should unify {{python/CMakeLists.txt}} and 
{{python/pyarrow/src/CMakeLists.txt}}.
* We need to look into why {{CXXFLAGS}} environment variable is ignored.

> [Doc][C++][Python] Unclear how to influence compilation flags
> -
>
> Key: ARROW-17580
> URL: https://issues.apache.org/jira/browse/ARROW-17580
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Python
>Reporter: Antoine Pitrou
>Priority: Critical
>
> Frequently people need to customize compilation flags for C++ and/or C files.
> Unfortunately, both for Arrow C++ and PyArrow, it is very difficult to find 
> out the proper way to do this.
> For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while 
> the {{CXXFLAGS}} environment variable is ignored (it probably shouldn't?).
> For PyArrow, I have not found a way to do it. The {{CXXFLAGS}} environment 
> variable is ignored, and the {{PYARROW_CXXFLAGS}} CMake variable has two 
> problems:
> * it is only recognized for Cython-generated files, not for PyArrow C++ 
> sources
> * it only affects linker calls, while it should actually affect compiler calls



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17578) [CI] Nightly test-r-gcc-12 fails to build

2022-08-31 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598676#comment-17598676
 ] 

Kouhei Sutou commented on ARROW-17578:
--

We need to focus on the following message:

https://github.com/ursacomputing/crossbow/runs/8104457062?check_suite_focus=true#step:5:1897

{noformat}
#4 9.317 The following packages have unmet dependencies:
#4 9.411  python3-dev : Depends: python3-distutils (>= 3.10.6-1~) but 
3.10.4-0ubuntu1 is to be installed
#4 9.415 E: Unable to correct problems, you have held broken packages.
{noformat}

This is caused by {{ppa:ubuntu-toolchain-r/volatile}} 
https://github.com/apache/arrow/blob/master/ci/docker/ubuntu-22.04-cpp.dockerfile#L118
 . It provides {{python3-dev=3.10.6-1~22.04}} but doesn't provide 
{{python3-distutils=3.10.6-1~22.04}}.

It seems that Ubuntu 22.04 provides {{gcc-12}}: 
https://packages.ubuntu.com/search?keywords=gcc-12

How about using the system {{gcc-12}} instead of {{gcc-12}} provided by 
{{ppa:ubuntu-toolchain-r/volatile}}?

{noformat}
diff --git a/ci/docker/ubuntu-22.04-cpp.dockerfile 
b/ci/docker/ubuntu-22.04-cpp.dockerfile
index 05aca53151..514e314c40 100644
--- a/ci/docker/ubuntu-22.04-cpp.dockerfile
+++ b/ci/docker/ubuntu-22.04-cpp.dockerfile
@@ -112,7 +112,7 @@ RUN if [ "${gcc_version}" = "" ]; then \
   g++ \
   gcc; \
 else \
-  if [ "${gcc_version}" -gt "11" ]; then \
+  if [ "${gcc_version}" -gt "12" ]; then \
   apt-get update -y -q && \
   apt-get install -y -q --no-install-recommends 
software-properties-common && \
   add-apt-repository ppa:ubuntu-toolchain-r/volatile; \
{noformat}

> [CI] Nightly test-r-gcc-12 fails to build
> -
>
> Key: ARROW-17578
> URL: https://issues.apache.org/jira/browse/ARROW-17578
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Raúl Cumplido
>Priority: Major
>  Labels: Nightly
>
> [test-r-gcc-12|https://github.com/ursacomputing/crossbow/runs/8104457062?check_suite_focus=true]
>  has been failing to build since the 18th of August. The current error log is:
> {code:java}
>  #4 ERROR: executor failed running [/bin/bash -o pipefail -c apt-get update 
> -y &&     apt-get install -y         dirmngr         apt-transport-https      
>    software-properties-common &&     wget -qO- 
> https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc |         
> tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc &&     add-apt-repository 
> 'deb https://cloud.r-project.org/bin/linux/ubuntu '$(lsb_release 
> -cs)'-cran40/' &&     apt-get install -y         r-base=${r}*         
> r-recommended=${r}*         libxml2-dev         libgit2-dev         
> libssl-dev         clang         clang-format         clang-tidy         
> texlive-latex-base         locales         python3         python3-pip        
>  python3-dev &&     locale-gen en_US.UTF-8 &&     apt-get clean &&     rm -rf 
> /var/lib/apt/lists/*]: exit code: 100
> --
>  > [ 2/17] RUN apt-get update -y &&     apt-get install -y         dirmngr    
>      apt-transport-https         software-properties-common &&     wget -qO- 
> https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc |         
> tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc &&     add-apt-repository 
> 'deb https://cloud.r-project.org/bin/linux/ubuntu '$(lsb_release 
> -cs)'-cran40/' &&     apt-get install -y         r-base=4.2*         
> r-recommended=4.2*         libxml2-dev         libgit2-dev         libssl-dev 
>         clang         clang-format         clang-tidy         
> texlive-latex-base         locales         python3         python3-pip        
>  python3-dev &&     locale-gen en_US.UTF-8 &&     apt-get clean &&     rm -rf 
> /var/lib/apt/lists/*:
> --
> executor failed running [/bin/bash -o pipefail -c apt-get update -y &&     
> apt-get install -y         dirmngr         apt-transport-https         
> software-properties-common &&     wget -qO- 
> https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc |         
> tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc &&     add-apt-repository 
> 'deb https://cloud.r-project.org/bin/linux/ubuntu '$(lsb_release 
> -cs)'-cran40/' &&     apt-get install -y         r-base=${r}*         
> r-recommended=${r}*         libxml2-dev         libgit2-dev         
> libssl-dev         clang         clang-format         clang-tidy         
> texlive-latex-base         locales         python3         python3-pip        
>  python3-dev &&     locale-gen en_US.UTF-8 &&     apt-get clean &&     rm -rf 
> /var/lib/apt/lists/*]: exit code: 100
> Service 'ubuntu-r-only-r' failed to build : Build failed {code}
> I can't reproduce locally (I get a different error) but could it be that w

[jira] [Resolved] (ARROW-17079) [C++] Improve error message propagation from AWS SDK

2022-08-31 Thread Philipp Moritz (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp Moritz resolved ARROW-17079.

Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 14001
[https://github.com/apache/arrow/pull/14001]

> [C++] Improve error message propagation from AWS SDK
> 
>
> Key: ARROW-17079
> URL: https://issues.apache.org/jira/browse/ARROW-17079
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 8.0.0
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Dear all,
> I'd like to see if there is interest to improve the error messages that 
> originate from the AWS SDK. Especially for loading datasets from S3, there 
> are many things that can go wrong and the error messages that (Py)Arrow gives 
> are not always the most actionable, especially if the call involves many 
> different SDK functions. In particular, it would be great to have the 
> following attached to each error message:
>  * A machine parseable status code from the AWS SDK
>  * Information as to exactly which AWS SDK call failed, so it can be 
> disambiguated for Arrow API calls that use multiple AWS SDK calls
> In the ideal case, as a developer I could reconstruct the AWS SDK call that 
> failed from the error message (e.g. in a form the allows me to run the API 
> call via the "aws" CLI program) so I can debug errors and see how they relate 
> to my AWS infrastructure. Any progress in this direction would be super 
> helpful.
>  
> For context: I recently was debugging some permissioning issues in S3 based 
> on the current error codes and it was pretty hard to figure out what was 
> going on (see 
> [https://github.com/ray-project/ray/issues/19799#issuecomment-1185035602).]
>  
> I'm happy to take a stab at this problem but might need some help. Is 
> implementing a custom StatusDetail class for AWS errors and propagating 
> errors that way the right hunch here? 
> [https://github.com/apache/arrow/blob/50f6fcad6cc09c06e78dcd09ad07218b86e689de/cpp/src/arrow/status.h#L110]
>  
> All the best,
> Philipp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14161) [C++][Parquet][Docs] Reading/Writing Parquet Files

2022-08-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14161:
---
Labels: pull-request-available  (was: )

> [C++][Parquet][Docs] Reading/Writing Parquet Files
> --
>
> Key: ARROW-14161
> URL: https://issues.apache.org/jira/browse/ARROW-14161
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Rares Vernica
>Assignee: Will Jones
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Missing documentation on Reading/Writing Parquet files C++ api:
>  * 
> [WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE]
>  missing docs on chunk_size found some 
> [here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53]
>  _size of the RowGroup in the parquet file. Normally you would choose this to 
> be rather large_
>  * Typo in file reader 
> [example|https://arrow.apache.org/docs/cpp/parquet.html#filereader] the 
> include should be {{#include "parquet/arrow/reader.h"}}
>  * 
> [WriteProperties/Builder|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet16WriterPropertiesE]
>  missing docs on {{compression}}
>  * Missing example on using WriteProperties



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17589) [Docs] Clarify that C data interface is not strictly for within a single process

2022-08-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17589:
---
Labels: pull-request-available  (was: )

> [Docs] Clarify that C data interface is not strictly for within a single 
> process
> 
>
> Key: ARROW-17589
> URL: https://issues.apache.org/jira/browse/ARROW-17589
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 9.0.0
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The Arrow C data interface and C stream interface docs say that these 
> interfaces are only for sharing data within a single process. This is not 
> exactly correct. The fundamental requirement for using the C data interface 
> and C stream interface is not that the components be necessarily running in 
> the same process. It’s that they are able to pass around a pointer and access 
> the same memory address space. Let's make some minor changes to the docs to 
> clarify this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17589) [Docs] Clarify that C data interface is not strictly for within a single process

2022-08-31 Thread Ian Cook (Jira)
Ian Cook created ARROW-17589:


 Summary: [Docs] Clarify that C data interface is not strictly for 
within a single process
 Key: ARROW-17589
 URL: https://issues.apache.org/jira/browse/ARROW-17589
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 9.0.0
Reporter: Ian Cook
Assignee: Ian Cook
 Fix For: 10.0.0


The Arrow C data interface and C stream interface docs say that these 
interfaces are only for sharing data within a single process. This is not 
exactly correct. The fundamental requirement for using the C data interface and 
C stream interface is not that the components be necessarily running in the 
same process. It’s that they are able to pass around a pointer and access the 
same memory address space. Let's make some minor changes to the docs to clarify 
this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17577) [C++][Python] CMake cannot find Arrow/Arrow Python when building PyArrow

2022-08-31 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598629#comment-17598629
 ] 

Kouhei Sutou commented on ARROW-17577:
--

Sorry for your inconvenient...
I think that you have {{PKG_CONFIG_PATH=$ARROW_HOME/lib/pkgconfig}} environment 
variable. You need to set {{CMAKE_PREFIX_PATH=$ARROW_HOME}} environment 
variable instead to find {{Arrow}} and other CMake packages.

I'll update related documents by ARROW-17575.

> [C++][Python] CMake cannot find Arrow/Arrow Python when building PyArrow
> 
>
> Key: ARROW-17577
> URL: https://issues.apache.org/jira/browse/ARROW-17577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Alenka Frim
>Priority: Major
>
> When building on master yesterday the PyArrow built worked fine. Today there 
> is an issue with CMake unable to find packages. See:
>  
> {code:java}
> -- Finished CMake build and install for PyArrow C++
> creating /Users/alenkafrim/repos/arrow/python/build/temp.macosx-12-arm64-3.9
> -- Running cmake for PyArrow
> cmake -DPYTHON_EXECUTABLE=/Users/alenkafrim/repos/pyarrow-dev-9/bin/python 
> -DPython3_EXECUTABLE=/Users/alenkafrim/repos/pyarrow-dev-9/bin/python 
> -DPYARROW_CPP_HOME=/Users/alenkafrim/repos/arrow/python/build/dist "" 
> -DPYARROW_BUILD_CUDA=off -DPYARROW_BUILD_SUBSTRAIT=off 
> -DPYARROW_BUILD_FLIGHT=on -DPYARROW_BUILD_GANDIVA=off 
> -DPYARROW_BUILD_DATASET=on -DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=on 
> -DPYARROW_BUILD_PARQUET_ENCRYPTION=off -DPYARROW_BUILD_PLASMA=off 
> -DPYARROW_BUILD_GCS=off -DPYARROW_BUILD_S3=on -DPYARROW_BUILD_HDFS=off 
> -DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off 
> -DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off 
> -DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on 
> -DCMAKE_BUILD_TYPE=release /Users/alenkafrim/repos/arrow/python
> CMake Warning:
>   Ignoring empty string ("") provided on the command line.
> -- The C compiler identification is AppleClang 13.1.6.13160021
> -- The CXX compiler identification is AppleClang 13.1.6.13160021
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Check for working C compiler: 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
>  - skipped
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Check for working CXX compiler: 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
>  - skipped
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- System processor: arm64
> -- Performing Test CXX_SUPPORTS_ARMV8_ARCH
> -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Success
> -- Arrow build warning level: PRODUCTION
> -- Configured for RELEASE build (set with cmake 
> -DCMAKE_BUILD_TYPE={release,debug,...})
> -- Build Type: RELEASE
> -- Generator: Unix Makefiles
> -- Build output directory: 
> /Users/alenkafrim/repos/arrow/python/build/temp.macosx-12-arm64-3.9/release
> -- Found Python3: /Users/alenkafrim/repos/pyarrow-dev-9/bin/python (found 
> version "3.9.13") found components: Interpreter Development.Module NumPy 
> -- Found Python3Alt: /Users/alenkafrim/repos/pyarrow-dev-9/bin/python  
> CMake Error at 
> /opt/homebrew/Cellar/cmake/3.24.1/share/cmake/Modules/CMakeFindDependencyMacro.cmake:47
>  (find_package):
>   By not providing "FindArrow.cmake" in CMAKE_MODULE_PATH this project has
>   asked CMake to find a package configuration file provided by "Arrow", but
>   CMake did not find one.
>   Could not find a package configuration file provided by "Arrow" with any of
>   the following names:
>     ArrowConfig.cmake
>     arrow-config.cmake
>   Add the installation prefix of "Arrow" to CMAKE_PREFIX_PATH or set
>   "Arrow_DIR" to a directory containing one of the above files.  If "Arrow"
>   provides a separate development package or SDK, be sure it has been
>   installed.
> Call Stack (most recent call first):
>   build/dist/lib/cmake/ArrowPython/ArrowPythonConfig.cmake:54 
> (find_dependency)
>   CMakeLists.txt:240 (find_package)
> {code}
> I did a clean built on the latest master. Am I missing some variables that 
> need to be set after [https://github.com/apache/arrow/pull/13892] ?
> I am calling cmake with these flags:
>  
> {code:java}
> cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>               -DCMAKE_INSTALL_LIBDIR=lib \
>               -DCMAKE_BUILD_TYPE=debug \
>               -DARROW_WITH_BZ2=ON \
>               -DARROW_WITH_ZLIB=ON \
>               -DARROW_WITH_ZSTD=ON \
>               -DARROW_WITH_LZ4=ON \
>               -DARROW_WITH_SNAPPY=ON \
>               -DARROW_WITH_B

[jira] [Updated] (ARROW-17575) [C++][Docs] Update build document to follow new CMake package

2022-08-31 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-17575:
-
Description: 
This is a follow-up of ARROW-12175.

https://github.com/apache/arrow/blob/master/docs/source/cpp/build_system.rst 
should be updated.

  was:This is a follow-up of ARROW-12175.


> [C++][Docs] Update build document to follow new CMake package
> -
>
> Key: ARROW-17575
> URL: https://issues.apache.org/jira/browse/ARROW-17575
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>
> This is a follow-up of ARROW-12175.
> https://github.com/apache/arrow/blob/master/docs/source/cpp/build_system.rst 
> should be updated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-31 Thread Arthur Passos (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598581#comment-17598581
 ] 

Arthur Passos commented on ARROW-17459:
---

I am a bit lost rn. I have made some changes to use LargeBinaryBuilder, but 
there is always an incosistency that throws an exception. Are you aware of any 
place in the code where instead of taking the String path it would take the 
LargeString path? I went all the way back to where it reads the schema in the 
hope of finding a place I could change the DataType from STRING to 
LARGE_STRING. Couldn't do so.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17588) [Go] Casting to BinaryLike types

2022-08-31 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-17588:
-

 Summary: [Go] Casting to BinaryLike types
 Key: ARROW-17588
 URL: https://issues.apache.org/jira/browse/ARROW-17588
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Go
Reporter: Matthew Topol
Assignee: Matthew Topol






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17587) [Go] Cast from Extension Types

2022-08-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17587:
---
Labels: pull-request-available  (was: )

> [Go] Cast from Extension Types
> --
>
> Key: ARROW-17587
> URL: https://issues.apache.org/jira/browse/ARROW-17587
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17587) [Go] Cast from Extension Types

2022-08-31 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-17587:
-

 Summary: [Go] Cast from Extension Types
 Key: ARROW-17587
 URL: https://issues.apache.org/jira/browse/ARROW-17587
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Go
Reporter: Matthew Topol
Assignee: Matthew Topol






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17584) [Go] Unable to build with tinygo

2022-08-31 Thread Matthew Topol (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598533#comment-17598533
 ] 

Matthew Topol commented on ARROW-17584:
---

[~tschaub] So far we've been keeping the N-2 concept for support, as 
`unsafe.Slice` was introduced in go1.17 it would make sense for us to bump our 
required version to go1.17 and convert our references to the sliceheader to 
instead use `unsafe.Slice`. Can you check / confirm whether or not 
`unsafe.Slice` works correctly with TinyGo such that it would be a viable 
alternative?

> [Go] Unable to build with tinygo
> 
>
> Key: ARROW-17584
> URL: https://issues.apache.org/jira/browse/ARROW-17584
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: Tim Schaub
>Priority: Major
>
> I was hoping to use TinyGo to build WASM binaries with Arrow.  TinyGo can 
> generate builds that are [1% the 
> size|https://tinygo.org/getting-started/overview/#:~:text=The%20only%20difference%20here%2C%20is,used%2C%20and%20the%20associated%20runtime.&text=In%20this%20case%20the%20Go,size%20(251k%20before%20stripping)!]
>  of those generated with Go (significant for applications hosted on the web).
> Arrow's use of `reflect.SliceHeader` fields limits the portability of the 
> code.  For example, the `Len` and `Cap` fields are assumed to be `int` here: 
> https://github.com/apache/arrow/blob/go/v9.0.0/go/arrow/bitutil/bitutil.go#L158-L159
> Go's [reflect package 
> warns|https://github.com/golang/go/blob/go1.19/src/reflect/value.go#L2675-L2685]
>  that the SliceHeader "cannot be used safely or portably and its 
> representation may change in a later release."
> Attempts to build a WASM binary using the github.com/apache/arrow/go/v10 
> module result in failures like this:
> {code}
> tinygo build -tags noasm -o test.wasm ./main.go
> {code}
> {code} 
> # github.com/apache/arrow/go/v10/arrow/bitutil
> ../../go/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220831082949-cf27001da088/arrow/bitutil/bitutil.go:158:10:
>  invalid operation: h.Len / uint64SizeBytes (mismatched types uintptr and int)
> ../../go/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220831082949-cf27001da088/arrow/bitutil/bitutil.go:159:10:
>  invalid operation: h.Cap / uint64SizeBytes (mismatched types uintptr and int)
> {code}
> This happens because TinyGo uses `uintptr` for the corresponding types: 
> https://github.com/tinygo-org/tinygo/blob/v0.25.0/src/reflect/value.go#L773-L777
> This feels like an issue with TinyGo, and it has been ticketed there multiple 
> times (see https://github.com/tinygo-org/tinygo/issues/1284).  They lean on 
> the warnings in the Go sources that use of the SliceHeader fields makes code 
> unportable and suggest changes to the libraries that do not heed this warning.
> I don't have a suggested fix or alternative for Arrow's use of SliceHeader 
> fields, but I'm wondering if there would be willingness on the part of this 
> package to make WASM builds work with TinyGo.  Perhaps the TinyGo authors 
> could also offer suggested changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17586) [Go] String to Numeric Cast functions

2022-08-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17586:
---
Labels: pull-request-available  (was: )

> [Go] String to Numeric Cast functions
> -
>
> Key: ARROW-17586
> URL: https://issues.apache.org/jira/browse/ARROW-17586
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17573) [Go] Parquet ByteArray statistics cause memory leak

2022-08-31 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-17573.
---
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 14013
[https://github.com/apache/arrow/pull/14013]

> [Go] Parquet ByteArray statistics cause memory leak
> ---
>
> Key: ARROW-17573
> URL: https://issues.apache.org/jira/browse/ARROW-17573
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go, Parquet
>Affects Versions: 9.0.0
>Reporter: Sasha Sirovica
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When using `arrow.BinaryTypes.String` in a schema, appending multiple 
> strings, and then writing a record out to parquet the memory of the program 
> continuously increases. This also applies for the other `arrow.BinaryTypes` 
>  
> I took a heap dump midway through the program and the majority of allocations 
> comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM 
> before terminating the program.
>  
> I was not able to replicate this behavior with just PrimativeTypes. Another 
> interesting point, if the records are created but never written with pqarrow 
> memory does not grow. In the below program commenting out `w.Write(rec)` will 
> not cause memory issues.
> Example program which causes memory to leak:
> {code:java}
> package main
> import (
>"os"
>"github.com/apache/arrow/go/v9/arrow"
>"github.com/apache/arrow/go/v9/arrow/array"
>"github.com/apache/arrow/go/v9/arrow/memory"
>"github.com/apache/arrow/go/v9/parquet"
>"github.com/apache/arrow/go/v9/parquet/compress"
>"github.com/apache/arrow/go/v9/parquet/pqarrow"
> )
> func main() {
>f, _ := os.Create("/tmp/test.parquet")
>arrowProps := pqarrow.DefaultWriterProps()
>schema := arrow.NewSchema(
>   []arrow.Field{
>  {Name: "aString", Type: arrow.BinaryTypes.String},
>   },
>   nil,
>)
>w, _ := pqarrow.NewFileWriter(schema, f, 
> parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), 
> arrowProps)
>builder := array.NewRecordBuilder(memory.DefaultAllocator, schema)
>for i := 1; i < 50; i++ {
>   builder.Field(0).(*array.StringBuilder).Append("HelloWorld!")
>   if i%200 == 0 {
>  // Write row groups out every 2M times
>  rec := builder.NewRecord()
>  w.Write(rec)
>  rec.Release()
>   }
>}
>w.Close()
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17586) [Go] String to Numeric Cast functions

2022-08-31 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-17586:
-

 Summary: [Go] String to Numeric Cast functions
 Key: ARROW-17586
 URL: https://issues.apache.org/jira/browse/ARROW-17586
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Go
Reporter: Matthew Topol
Assignee: Matthew Topol






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-14161) [C++][Parquet][Docs] Reading/Writing Parquet Files

2022-08-31 Thread Will Jones (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-14161:
--

Assignee: Will Jones

> [C++][Parquet][Docs] Reading/Writing Parquet Files
> --
>
> Key: ARROW-14161
> URL: https://issues.apache.org/jira/browse/ARROW-14161
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Rares Vernica
>Assignee: Will Jones
>Priority: Minor
> Fix For: 10.0.0
>
>
> Missing documentation on Reading/Writing Parquet files C++ api:
>  * 
> [WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE]
>  missing docs on chunk_size found some 
> [here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53]
>  _size of the RowGroup in the parquet file. Normally you would choose this to 
> be rather large_
>  * Typo in file reader 
> [example|https://arrow.apache.org/docs/cpp/parquet.html#filereader] the 
> include should be {{#include "parquet/arrow/reader.h"}}
>  * 
> [WriteProperties/Builder|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet16WriterPropertiesE]
>  missing docs on {{compression}}
>  * Missing example on using WriteProperties



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17585) [Java] Extend types supported by GenerateSampleData

2022-08-31 Thread Larry White (Jira)
Larry White created ARROW-17585:
---

 Summary: [Java] Extend types supported by GenerateSampleData
 Key: ARROW-17585
 URL: https://issues.apache.org/jira/browse/ARROW-17585
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Larry White


org.apache.arrow.vector.GenerateSampleTypes does not support the Uint vector 
types.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-14958) [C++][FlightRPC] Enable OpenTelemetry with Arrow Flight

2022-08-31 Thread Todd Farmer (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598502#comment-17598502
 ] 

Todd Farmer commented on ARROW-14958:
-

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++][FlightRPC] Enable OpenTelemetry with Arrow Flight
> ---
>
> Key: ARROW-14958
> URL: https://issues.apache.org/jira/browse/ARROW-14958
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Sans Python support, at least for now, since figuring out how to do the 
> bindings will be a challenge there. Also see 
> [https://github.com/open-telemetry/community/discussions/734]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-14958) [C++][FlightRPC] Enable OpenTelemetry with Arrow Flight

2022-08-31 Thread Todd Farmer (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer reassigned ARROW-14958:
---

Assignee: (was: David Li)

> [C++][FlightRPC] Enable OpenTelemetry with Arrow Flight
> ---
>
> Key: ARROW-14958
> URL: https://issues.apache.org/jira/browse/ARROW-14958
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC
>Reporter: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Sans Python support, at least for now, since figuring out how to do the 
> bindings will be a challenge there. Also see 
> [https://github.com/open-telemetry/community/discussions/734]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-31 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598501#comment-17598501
 ] 

Micah Kornfield commented on ARROW-17459:
-

Yes, I think there are some code changes, we hard-code non large 
[BinaryBuilder|https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.cc#L1693]
 for accumulating chunks and then used when [decoding 
arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1392].

To answer your questions, I don't think the second case applies.  As far as I 
know Parquet C++ does its own chunking and doesn't try to read back the exact 
chunking that the values are written with.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-6485) [Format][C++]Support the format of a COO sparse matrix that has separated row and column indices

2022-08-31 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc reassigned ARROW-6485:
-

Assignee: Rok Mihevc

> [Format][C++]Support the format of a COO sparse matrix that has separated row 
> and column indices
> 
>
> Key: ARROW-6485
> URL: https://issues.apache.org/jira/browse/ARROW-6485
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Kenta Murata
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> For supporting non-copy interchanging of scipy.sparse.coo_matrix, I'd like to 
> add the new format of a COO matrix that has separated row and column indices.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15787) [C++] Temporal floor/ceil/round kernels could be optimised with templating

2022-08-31 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc reassigned ARROW-15787:
--

Assignee: Rok Mihevc

> [C++] Temporal floor/ceil/round kernels could be optimised with templating
> --
>
> Key: ARROW-15787
> URL: https://issues.apache.org/jira/browse/ARROW-15787
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Assignee: Rok Mihevc
>Priority: Minor
>  Labels: kernel, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> [CeilTemporal, FloorTemporal, 
> RoundTemporal|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc#L728-L980]
>  kernels could probably be templated in a clean way. They also execute a 
> switch statement for every call instead of creating an operator at kernel 
> call time and only running that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16147) [C++] ParquetFileWriter doesn't call sink_.Close when using GcsRandomAccessFile

2022-08-31 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc resolved ARROW-16147.

Resolution: Fixed

> [C++] ParquetFileWriter doesn't call sink_.Close when using 
> GcsRandomAccessFile
> ---
>
> Key: ARROW-16147
> URL: https://issues.apache.org/jira/browse/ARROW-16147
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Rok Mihevc
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: GCP
>
> On parquet::arrow::FileWriter::Close the underlying sink is not closed. The 
> implementation goes to FileSerializer::Close:
> {code:cpp}
> void Close() override {
> if (is_open_) {
>   // If any functions here raise an exception, we set is_open_ to be false
>   // so that this does not get called again (possibly causing segfault)
>   is_open_ = false;
>   if (row_group_writer_) {
> num_rows_ += row_group_writer_->num_rows();
> row_group_writer_->Close();
>   }
>   row_group_writer_.reset();
>   // Write magic bytes and metadata
>   auto file_encryption_properties = 
> properties_->file_encryption_properties();
>   if (file_encryption_properties == nullptr) {  // Non encrypted file.
> file_metadata_ = metadata_->Finish();
> WriteFileMetaData(*file_metadata_, sink_.get());
>   } else {  // Encrypted file
> CloseEncryptedFile(file_encryption_properties);
>   }
> }
>   }
> {code}
> It doesn't call sink_->Close(), which leads to resource leaking and bugs.
> With files (they have own close() in destructor) it works fine, but doesn't 
> work with fs::GcsRandomAccessFile. When I calling 
> parquet::arrow::FileWriter::Close the data is not flushed to storage, until 
> manual close of a sink stream (or stack space change).
> Is it done by intention or a bug?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16147) [C++] ParquetFileWriter doesn't call sink_.Close when using GcsRandomAccessFile

2022-08-31 Thread Rok Mihevc (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598492#comment-17598492
 ] 

Rok Mihevc commented on ARROW-16147:


I think this was resolved by the linked Micah's PR.

> [C++] ParquetFileWriter doesn't call sink_.Close when using 
> GcsRandomAccessFile
> ---
>
> Key: ARROW-16147
> URL: https://issues.apache.org/jira/browse/ARROW-16147
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: GCP
>
> On parquet::arrow::FileWriter::Close the underlying sink is not closed. The 
> implementation goes to FileSerializer::Close:
> {code:cpp}
> void Close() override {
> if (is_open_) {
>   // If any functions here raise an exception, we set is_open_ to be false
>   // so that this does not get called again (possibly causing segfault)
>   is_open_ = false;
>   if (row_group_writer_) {
> num_rows_ += row_group_writer_->num_rows();
> row_group_writer_->Close();
>   }
>   row_group_writer_.reset();
>   // Write magic bytes and metadata
>   auto file_encryption_properties = 
> properties_->file_encryption_properties();
>   if (file_encryption_properties == nullptr) {  // Non encrypted file.
> file_metadata_ = metadata_->Finish();
> WriteFileMetaData(*file_metadata_, sink_.get());
>   } else {  // Encrypted file
> CloseEncryptedFile(file_encryption_properties);
>   }
> }
>   }
> {code}
> It doesn't call sink_->Close(), which leads to resource leaking and bugs.
> With files (they have own close() in destructor) it works fine, but doesn't 
> work with fs::GcsRandomAccessFile. When I calling 
> parquet::arrow::FileWriter::Close the data is not flushed to storage, until 
> manual close of a sink stream (or stack space change).
> Is it done by intention or a bug?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16147) [C++] ParquetFileWriter doesn't call sink_.Close when using GcsRandomAccessFile

2022-08-31 Thread Rok Mihevc (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc reassigned ARROW-16147:
--

Assignee: Micah Kornfield

> [C++] ParquetFileWriter doesn't call sink_.Close when using 
> GcsRandomAccessFile
> ---
>
> Key: ARROW-16147
> URL: https://issues.apache.org/jira/browse/ARROW-16147
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Rok Mihevc
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: GCP
>
> On parquet::arrow::FileWriter::Close the underlying sink is not closed. The 
> implementation goes to FileSerializer::Close:
> {code:cpp}
> void Close() override {
> if (is_open_) {
>   // If any functions here raise an exception, we set is_open_ to be false
>   // so that this does not get called again (possibly causing segfault)
>   is_open_ = false;
>   if (row_group_writer_) {
> num_rows_ += row_group_writer_->num_rows();
> row_group_writer_->Close();
>   }
>   row_group_writer_.reset();
>   // Write magic bytes and metadata
>   auto file_encryption_properties = 
> properties_->file_encryption_properties();
>   if (file_encryption_properties == nullptr) {  // Non encrypted file.
> file_metadata_ = metadata_->Finish();
> WriteFileMetaData(*file_metadata_, sink_.get());
>   } else {  // Encrypted file
> CloseEncryptedFile(file_encryption_properties);
>   }
> }
>   }
> {code}
> It doesn't call sink_->Close(), which leads to resource leaking and bugs.
> With files (they have own close() in destructor) it works fine, but doesn't 
> work with fs::GcsRandomAccessFile. When I calling 
> parquet::arrow::FileWriter::Close the data is not flushed to storage, until 
> manual close of a sink stream (or stack space change).
> Is it done by intention or a bug?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15011) [R] Generate documentation for dplyr function bindings

2022-08-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15011:
---
Labels: pull-request-available  (was: )

> [R] Generate documentation for dplyr function bindings
> --
>
> Key: ARROW-15011
> URL: https://issues.apache.org/jira/browse/ARROW-15011
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We don't want to (re)write the documentation for each binding that exists, 
> but could we use templates or other automated ways of documenting "This 
> binding should work just like X from package Y" when that's true, and then 
> have a place to put some of the exceptions? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15011) [R] Generate documentation for dplyr function bindings

2022-08-31 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-15011:

Summary: [R] Generate documentation for dplyr function bindings  (was: [R] 
Can we (semi?) automatically document when a binding exists)

> [R] Generate documentation for dplyr function bindings
> --
>
> Key: ARROW-15011
> URL: https://issues.apache.org/jira/browse/ARROW-15011
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> We don't want to (re)write the documentation for each binding that exists, 
> but could we use templates or other automated ways of documenting "This 
> binding should work just like X from package Y" when that's true, and then 
> have a place to put some of the exceptions? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15011) [R] Generate documentation for dplyr function bindings

2022-08-31 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-15011:
---

Assignee: Neal Richardson  (was: Dragoș Moldovan-Grünfeld)

> [R] Generate documentation for dplyr function bindings
> --
>
> Key: ARROW-15011
> URL: https://issues.apache.org/jira/browse/ARROW-15011
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Neal Richardson
>Priority: Major
>
> We don't want to (re)write the documentation for each binding that exists, 
> but could we use templates or other automated ways of documenting "This 
> binding should work just like X from package Y" when that's true, and then 
> have a place to put some of the exceptions? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17584) [Go] Unable to build with tinygo

2022-08-31 Thread Tim Schaub (Jira)
Tim Schaub created ARROW-17584:
--

 Summary: [Go] Unable to build with tinygo
 Key: ARROW-17584
 URL: https://issues.apache.org/jira/browse/ARROW-17584
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Tim Schaub


I was hoping to use TinyGo to build WASM binaries with Arrow.  TinyGo can 
generate builds that are [1% the 
size|https://tinygo.org/getting-started/overview/#:~:text=The%20only%20difference%20here%2C%20is,used%2C%20and%20the%20associated%20runtime.&text=In%20this%20case%20the%20Go,size%20(251k%20before%20stripping)!]
 of those generated with Go (significant for applications hosted on the web).

Arrow's use of `reflect.SliceHeader` fields limits the portability of the code. 
 For example, the `Len` and `Cap` fields are assumed to be `int` here: 
https://github.com/apache/arrow/blob/go/v9.0.0/go/arrow/bitutil/bitutil.go#L158-L159

Go's [reflect package 
warns|https://github.com/golang/go/blob/go1.19/src/reflect/value.go#L2675-L2685]
 that the SliceHeader "cannot be used safely or portably and its representation 
may change in a later release."

Attempts to build a WASM binary using the github.com/apache/arrow/go/v10 module 
result in failures like this:
{code}
tinygo build -tags noasm -o test.wasm ./main.go
{code}
{code} 
# github.com/apache/arrow/go/v10/arrow/bitutil
../../go/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220831082949-cf27001da088/arrow/bitutil/bitutil.go:158:10:
 invalid operation: h.Len / uint64SizeBytes (mismatched types uintptr and int)
../../go/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220831082949-cf27001da088/arrow/bitutil/bitutil.go:159:10:
 invalid operation: h.Cap / uint64SizeBytes (mismatched types uintptr and int)
{code}

This happens because TinyGo uses `uintptr` for the corresponding types: 
https://github.com/tinygo-org/tinygo/blob/v0.25.0/src/reflect/value.go#L773-L777

This feels like an issue with TinyGo, and it has been ticketed there multiple 
times (see https://github.com/tinygo-org/tinygo/issues/1284).  They lean on the 
warnings in the Go sources that use of the SliceHeader fields makes code 
unportable and suggest changes to the libraries that do not heed this warning.

I don't have a suggested fix or alternative for Arrow's use of SliceHeader 
fields, but I'm wondering if there would be willingness on the part of this 
package to make WASM builds work with TinyGo.  Perhaps the TinyGo authors could 
also offer suggested changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17583) [Python] File write visitor throws exception on large parquet file

2022-08-31 Thread Joost Hoozemans (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598477#comment-17598477
 ] 

Joost Hoozemans commented on ARROW-17583:
-

Thanks for the quick response! This should be a small change, I think I can 
submit something tomorrow.

> [Python] File write visitor throws exception on large parquet file
> --
>
> Key: ARROW-17583
> URL: https://issues.apache.org/jira/browse/ARROW-17583
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: Joost Hoozemans
>Assignee: Joost Hoozemans
>Priority: Major
>
> When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws 
> an exception:
> Traceback (most recent call last):
>   File "pyarrow/_dataset_parquet.pyx", line 165, in 
> pyarrow._dataset_parquet.ParquetFileFormat._finish_write
>   File "pyarrow/{_}dataset.pyx", line 2695, in 
> pyarrow._dataset.WrittenFile.{_}{_}init{_}_
> OverflowError: value too large to convert to int
> Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor'
> The file is written succesfully though. It seems related to this issue 
> https://issues.apache.org/jira/browse/ARROW-16761.
> I would guess the problem is the python field is an int while the C++ code 
> returns an int64_t 
> [https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17583) [Python] File write visitor throws exception on large parquet file

2022-08-31 Thread Joost Hoozemans (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joost Hoozemans reassigned ARROW-17583:
---

Assignee: Joost Hoozemans

> [Python] File write visitor throws exception on large parquet file
> --
>
> Key: ARROW-17583
> URL: https://issues.apache.org/jira/browse/ARROW-17583
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: Joost Hoozemans
>Assignee: Joost Hoozemans
>Priority: Major
>
> When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws 
> an exception:
> Traceback (most recent call last):
>   File "pyarrow/_dataset_parquet.pyx", line 165, in 
> pyarrow._dataset_parquet.ParquetFileFormat._finish_write
>   File "pyarrow/{_}dataset.pyx", line 2695, in 
> pyarrow._dataset.WrittenFile.{_}{_}init{_}_
> OverflowError: value too large to convert to int
> Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor'
> The file is written succesfully though. It seems related to this issue 
> https://issues.apache.org/jira/browse/ARROW-16761.
> I would guess the problem is the python field is an int while the C++ code 
> returns an int64_t 
> [https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17583) [Python] File write visitor throws exception on large parquet file

2022-08-31 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-17583:
---
Priority: Major  (was: Minor)

> [Python] File write visitor throws exception on large parquet file
> --
>
> Key: ARROW-17583
> URL: https://issues.apache.org/jira/browse/ARROW-17583
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: Joost Hoozemans
>Priority: Major
>
> When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws 
> an exception:
> Traceback (most recent call last):
>   File "pyarrow/_dataset_parquet.pyx", line 165, in 
> pyarrow._dataset_parquet.ParquetFileFormat._finish_write
>   File "pyarrow/{_}dataset.pyx", line 2695, in 
> pyarrow._dataset.WrittenFile.{_}{_}init{_}_
> OverflowError: value too large to convert to int
> Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor'
> The file is written succesfully though. It seems related to this issue 
> https://issues.apache.org/jira/browse/ARROW-16761.
> I would guess the problem is the python field is an int while the C++ code 
> returns an int64_t 
> [https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17583) [Python] File write visitor throws exception on large parquet file

2022-08-31 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598471#comment-17598471
 ] 

Antoine Pitrou commented on ARROW-17583:


Your diagnosis seems right. Would you want to submit a PR?

> [Python] File write visitor throws exception on large parquet file
> --
>
> Key: ARROW-17583
> URL: https://issues.apache.org/jira/browse/ARROW-17583
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: Joost Hoozemans
>Priority: Minor
>
> When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws 
> an exception:
> Traceback (most recent call last):
>   File "pyarrow/_dataset_parquet.pyx", line 165, in 
> pyarrow._dataset_parquet.ParquetFileFormat._finish_write
>   File "pyarrow/{_}dataset.pyx", line 2695, in 
> pyarrow._dataset.WrittenFile.{_}{_}init{_}_
> OverflowError: value too large to convert to int
> Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor'
> The file is written succesfully though. It seems related to this issue 
> https://issues.apache.org/jira/browse/ARROW-16761.
> I would guess the problem is the python field is an int while the C++ code 
> returns an int64_t 
> [https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17583) [Python] File write visitor throws exception on large parquet file

2022-08-31 Thread Joost Hoozemans (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joost Hoozemans updated ARROW-17583:

Description: 
When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws 
an exception:

Traceback (most recent call last):
  File "pyarrow/_dataset_parquet.pyx", line 165, in 
pyarrow._dataset_parquet.ParquetFileFormat._finish_write
  File "pyarrow/{_}dataset.pyx", line 2695, in 
pyarrow._dataset.WrittenFile.{_}{_}init{_}_
OverflowError: value too large to convert to int
Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor'

The file is written succesfully though. It seems related to this issue 
https://issues.apache.org/jira/browse/ARROW-16761.

I would guess the problem is the python field is an int while the C++ code 
returns an int64_t 
[https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164]
 

  was:
When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws 
an exception:

Traceback (most recent call last):
  File "pyarrow/_dataset_parquet.pyx", line 165, in 
pyarrow._dataset_parquet.ParquetFileFormat._finish_write
  File "pyarrow/_dataset.pyx", line 2695, in 
pyarrow._dataset.WrittenFile.__init__
OverflowError: value too large to convert to int
Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor'

The file is written succesfully though. It seems related to this issue 
https://issues.apache.org/jira/browse/ARROW-16761.

I would guess the problem is the python field is an int while the C++ code 
return an int64_t 
[https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164]
 


> [Python] File write visitor throws exception on large parquet file
> --
>
> Key: ARROW-17583
> URL: https://issues.apache.org/jira/browse/ARROW-17583
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: Joost Hoozemans
>Priority: Minor
>
> When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws 
> an exception:
> Traceback (most recent call last):
>   File "pyarrow/_dataset_parquet.pyx", line 165, in 
> pyarrow._dataset_parquet.ParquetFileFormat._finish_write
>   File "pyarrow/{_}dataset.pyx", line 2695, in 
> pyarrow._dataset.WrittenFile.{_}{_}init{_}_
> OverflowError: value too large to convert to int
> Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor'
> The file is written succesfully though. It seems related to this issue 
> https://issues.apache.org/jira/browse/ARROW-16761.
> I would guess the problem is the python field is an int while the C++ code 
> returns an int64_t 
> [https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17583) [Python] File write visitor throws exception on large parquet file

2022-08-31 Thread Joost Hoozemans (Jira)
Joost Hoozemans created ARROW-17583:
---

 Summary: [Python] File write visitor throws exception on large 
parquet file
 Key: ARROW-17583
 URL: https://issues.apache.org/jira/browse/ARROW-17583
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 9.0.0
Reporter: Joost Hoozemans


When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws 
an exception:

Traceback (most recent call last):
  File "pyarrow/_dataset_parquet.pyx", line 165, in 
pyarrow._dataset_parquet.ParquetFileFormat._finish_write
  File "pyarrow/_dataset.pyx", line 2695, in 
pyarrow._dataset.WrittenFile.__init__
OverflowError: value too large to convert to int
Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor'

The file is written succesfully though. It seems related to this issue 
https://issues.apache.org/jira/browse/ARROW-16761.

I would guess the problem is the python field is an int while the C++ code 
return an int64_t 
[https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-17557) [Go] WASM build fails

2022-08-31 Thread Tim Schaub (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Schaub closed ARROW-17557.
--
Resolution: Fixed

The wasm build works with `-tags noasm` and either 
github.com/apache/thrift@v0.16.0 or arrow/parquet v10.

> [Go] WASM build fails
> -
>
> Key: ARROW-17557
> URL: https://issues.apache.org/jira/browse/ARROW-17557
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: Tim Schaub
>Priority: Major
>
> I see ARROW-4689 and it looks like 
> [https://github.com/apache/arrow/pull/3707] was supposed to add support for 
> building with {{GOOS=js GOARCH=wasm}}.
> When I try to build a wasm binary, I get the following failure
> {code}
> # GOOS=js GOARCH=wasm go build -o test.wasm ./main.go
> # github.com/apache/arrow/go/v9/internal/utils
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:76:4:
>  undefined: TransposeInt8Int8
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:78:4:
>  undefined: TransposeInt8Int16
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:80:4:
>  undefined: TransposeInt8Int32
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:82:4:
>  undefined: TransposeInt8Int64
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:84:4:
>  undefined: TransposeInt8Uint8
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:86:4:
>  undefined: TransposeInt8Uint16
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:88:4:
>  undefined: TransposeInt8Uint32
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:90:4:
>  undefined: TransposeInt8Uint64
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:95:4:
>  undefined: TransposeInt16Int8
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:97:4:
>  undefined: TransposeInt16Int16
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:97:4:
>  too many errors
> # github.com/apache/thrift/lib/go/thrift
> ../../go/pkg/mod/github.com/apache/thrift@v0.15.0/lib/go/thrift/socket_unix_conn.go:60:63:
>  undefined: syscall.MSG_PEEK
> ../../go/pkg/mod/github.com/apache/thrift@v0.15.0/lib/go/thrift/socket_unix_conn.go:60:80:
>  undefined: syscall.MSG_DONTWAIT
> {code}
> {code}
> go version go1.18.2 darwin/arm64
> {code}
> {code}
> github.com/apache/arrow/go/v9 v9.0.0
> {code}
> Does additional code need to be generated for the {{wasm}} arch?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17557) [Go] WASM build fails

2022-08-31 Thread Tim Schaub (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598465#comment-17598465
 ] 

Tim Schaub commented on ARROW-17557:


It looks like `-tags noasm` was one part of the issue.  The other issue with 
thrift@0.15.0 was addressed in https://github.com/apache/thrift/pull/2455.  I 
updated the arrow and parquet packages to v10, and the build works now.

Thank you for the help.

> [Go] WASM build fails
> -
>
> Key: ARROW-17557
> URL: https://issues.apache.org/jira/browse/ARROW-17557
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: Tim Schaub
>Priority: Major
>
> I see ARROW-4689 and it looks like 
> [https://github.com/apache/arrow/pull/3707] was supposed to add support for 
> building with {{GOOS=js GOARCH=wasm}}.
> When I try to build a wasm binary, I get the following failure
> {code}
> # GOOS=js GOARCH=wasm go build -o test.wasm ./main.go
> # github.com/apache/arrow/go/v9/internal/utils
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:76:4:
>  undefined: TransposeInt8Int8
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:78:4:
>  undefined: TransposeInt8Int16
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:80:4:
>  undefined: TransposeInt8Int32
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:82:4:
>  undefined: TransposeInt8Int64
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:84:4:
>  undefined: TransposeInt8Uint8
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:86:4:
>  undefined: TransposeInt8Uint16
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:88:4:
>  undefined: TransposeInt8Uint32
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:90:4:
>  undefined: TransposeInt8Uint64
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:95:4:
>  undefined: TransposeInt16Int8
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:97:4:
>  undefined: TransposeInt16Int16
> ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:97:4:
>  too many errors
> # github.com/apache/thrift/lib/go/thrift
> ../../go/pkg/mod/github.com/apache/thrift@v0.15.0/lib/go/thrift/socket_unix_conn.go:60:63:
>  undefined: syscall.MSG_PEEK
> ../../go/pkg/mod/github.com/apache/thrift@v0.15.0/lib/go/thrift/socket_unix_conn.go:60:80:
>  undefined: syscall.MSG_DONTWAIT
> {code}
> {code}
> go version go1.18.2 darwin/arm64
> {code}
> {code}
> github.com/apache/arrow/go/v9 v9.0.0
> {code}
> Does additional code need to be generated for the {{wasm}} arch?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17573) [Go] Parquet ByteArray statistics cause memory leak

2022-08-31 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol reassigned ARROW-17573:
-

Assignee: Matthew Topol

> [Go] Parquet ByteArray statistics cause memory leak
> ---
>
> Key: ARROW-17573
> URL: https://issues.apache.org/jira/browse/ARROW-17573
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go, Parquet
>Affects Versions: 9.0.0
>Reporter: Sasha Sirovica
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When using `arrow.BinaryTypes.String` in a schema, appending multiple 
> strings, and then writing a record out to parquet the memory of the program 
> continuously increases. This also applies for the other `arrow.BinaryTypes` 
>  
> I took a heap dump midway through the program and the majority of allocations 
> comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM 
> before terminating the program.
>  
> I was not able to replicate this behavior with just PrimativeTypes. Another 
> interesting point, if the records are created but never written with pqarrow 
> memory does not grow. In the below program commenting out `w.Write(rec)` will 
> not cause memory issues.
> Example program which causes memory to leak:
> {code:java}
> package main
> import (
>"os"
>"github.com/apache/arrow/go/v9/arrow"
>"github.com/apache/arrow/go/v9/arrow/array"
>"github.com/apache/arrow/go/v9/arrow/memory"
>"github.com/apache/arrow/go/v9/parquet"
>"github.com/apache/arrow/go/v9/parquet/compress"
>"github.com/apache/arrow/go/v9/parquet/pqarrow"
> )
> func main() {
>f, _ := os.Create("/tmp/test.parquet")
>arrowProps := pqarrow.DefaultWriterProps()
>schema := arrow.NewSchema(
>   []arrow.Field{
>  {Name: "aString", Type: arrow.BinaryTypes.String},
>   },
>   nil,
>)
>w, _ := pqarrow.NewFileWriter(schema, f, 
> parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), 
> arrowProps)
>builder := array.NewRecordBuilder(memory.DefaultAllocator, schema)
>for i := 1; i < 50; i++ {
>   builder.Field(0).(*array.StringBuilder).Append("HelloWorld!")
>   if i%200 == 0 {
>  // Write row groups out every 2M times
>  rec := builder.NewRecord()
>  w.Write(rec)
>  rec.Release()
>   }
>}
>w.Close()
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17569) [C++] Bump xsimd to 9.0.0

2022-08-31 Thread Bernhard Manfred Gruber (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598455#comment-17598455
 ] 

Bernhard Manfred Gruber commented on ARROW-17569:
-

Hi! I tried to push the update of xsimd to vcpkg and noticed that with the 
change arrow is failing to build: 
https://github.com/microsoft/vcpkg/pull/26501. Good to see that this is being 
worked on.

> [C++] Bump xsimd to 9.0.0
> -
>
> Key: ARROW-17569
> URL: https://issues.apache.org/jira/browse/ARROW-17569
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Serge Guelton
>Assignee: Serge Guelton
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> xsmd has released a new upstream version (namely 9.0.0), it would be nice to 
> match it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17582) Relax / extend type checking for pyarrow array creation

2022-08-31 Thread Gil Forsyth (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gil Forsyth updated ARROW-17582:

Description: 
in [ibis|https://github.com/ibis-project/ibis] we're interested in offering 
query results as a record batch – some of the data we're starting with is 
coming back from a {{sqlalchemy.cursor}} which _look_ like {{{}tuple{}}}s and 
{{{}dict{}}}s but are actually {{sqlalchemy.engine.row.LegacyRow}} and 
{{{}sqlalchemy.engine.row.RowMapping{}}}, respectively.

 

The checks in {{python_to_arrow.cc}} are strict enough that these can't be 
readily dumped into an {{array}} without first calling, e.g. {{tuple}} on the 
individual rows of the results.

 
{code:java}
In [168]: batch[:5]
Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)]
In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", 
pa.int32())])
In [170]: pa.array(batch[:5], type=pa_schema)
---
ArrowTypeError                            Traceback (most recent call last)
Input In [170], in ()
> 1 pa.array(batch[:5], type=pa_schema)
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317,
 in pyarrow.lib.array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39,
 in pyarrow.lib._sequence_to_array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144,
 in pyarrow.lib.pyarrow_internal_check_status()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123,
 in pyarrow.lib.check_status()
ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, 
value) pair
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938  
GetKeyValuePair(items, i)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010  
InferKeyKind(items)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73  func(value, 
static_cast(i), &keep_going)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182  
converter->Extend(seq, size)
{code}
vs
{code:java}
In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema)
Out[171]: 

-- is_valid: all not null
-- child 0 type: int32
  [
    1,
    1,
    1,
    1,
    1
  ]
-- child 1 type: int32
  [
    2173,
    943,
    892,
    30,
    337
  ]{code}
To avoid the overhead of this extra conversion, maybe there are some checks 
that aren't explicit python type-checks that we can rely on?

  was:
in [ibis|https://github.com/ibis-project/ibis] we're interested in offering 
query results as a record batch – some of the data we're starting with is 
coming back from a {{sqlalchemy.cursor}} which _look_ like {{{}tuple{}}}s and 
{{{}dict{}}}s but are actually {{sqlalchemy.engine.row.LegacyRow}} and 
{{{}sqlalchemy.engine.row.RowMapping{}}}, respectively.

 

The checks in `python_to_arrow.cc` are strict enough that these can't be 
readily dumped into an `array` without first calling, e.g. `tuple` on the 
individual rows of the results.

 

 
{code:java}
In [168]: batch[:5]
Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)]
In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", 
pa.int32())])
In [170]: pa.array(batch[:5], type=pa_schema)
---
ArrowTypeError                            Traceback (most recent call last)
Input In [170], in ()
> 1 pa.array(batch[:5], type=pa_schema)
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317,
 in pyarrow.lib.array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39,
 in pyarrow.lib._sequence_to_array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144,
 in pyarrow.lib.pyarrow_internal_check_status()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123,
 in pyarrow.lib.check_status()
ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, 
value) pair
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938  
GetKeyValuePair(items, i)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010  
InferKeyKind(items)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73  func(value, 
static_cast(i), &keep_going)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182  
converter->Extend(seq, size)
{code}
vs
{code:java}
In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema)
Out[171]: 

-- is_valid: all not null
-- child 0 type: int32
  [
    1

[jira] [Updated] (ARROW-17582) Relax / extend type checking for pyarrow array creation

2022-08-31 Thread Gil Forsyth (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gil Forsyth updated ARROW-17582:

Description: 
in [ibis|https://github.com/ibis-project/ibis] we're interested in offering 
query results as a record batch – some of the data we're starting with is 
coming back from a {{sqlalchemy.cursor}} which _look_ like {{{}tuple{}}}s and 
{{{}dict{}}}s but are actually {{sqlalchemy.engine.row.LegacyRow}} and 
{{{}sqlalchemy.engine.row.RowMapping{}}}, respectively.

 

The checks in `python_to_arrow.cc` are strict enough that these can't be 
readily dumped into an `array` without first calling, e.g. `tuple` on the 
individual rows of the results.

 

 
{code:java}
In [168]: batch[:5]
Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)]
In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", 
pa.int32())])
In [170]: pa.array(batch[:5], type=pa_schema)
---
ArrowTypeError                            Traceback (most recent call last)
Input In [170], in ()
> 1 pa.array(batch[:5], type=pa_schema)
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317,
 in pyarrow.lib.array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39,
 in pyarrow.lib._sequence_to_array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144,
 in pyarrow.lib.pyarrow_internal_check_status()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123,
 in pyarrow.lib.check_status()
ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, 
value) pair
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938  
GetKeyValuePair(items, i)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010  
InferKeyKind(items)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73  func(value, 
static_cast(i), &keep_going)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182  
converter->Extend(seq, size)
{code}
vs
{code:java}
In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema)
Out[171]: 

-- is_valid: all not null
-- child 0 type: int32
  [
    1,
    1,
    1,
    1,
    1
  ]
-- child 1 type: int32
  [
    2173,
    943,
    892,
    30,
    337
  ]{code}
To avoid the overhead of this extra conversion, maybe there are some checks 
that aren't explicit python type-checks that we can rely on?

  was:
in [ibis|https://github.com/ibis-project/ibis] we're interested in offering 
query results as a record batch – some of the data we're starting with is 
coming back from a `sqlalchemy.cursor` which _look_ like `tuple`s and `dict`s 
but are actually `sqlalchemy.engine.row.LegacyRow` and 
`sqlalchemy.engine.row.RowMapping`, respectively.

 

The checks in `python_to_arrow.cc` are strict enough that these can't be 
readily dumped into an `array` without first calling, e.g. `tuple` on the 
individual rows of the results.

 

 
{code:java}
In [168]: batch[:5]
Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)]
In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", 
pa.int32())])
In [170]: pa.array(batch[:5], type=pa_schema)
---
ArrowTypeError                            Traceback (most recent call last)
Input In [170], in ()
> 1 pa.array(batch[:5], type=pa_schema)
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317,
 in pyarrow.lib.array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39,
 in pyarrow.lib._sequence_to_array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144,
 in pyarrow.lib.pyarrow_internal_check_status()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123,
 in pyarrow.lib.check_status()
ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, 
value) pair
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938  
GetKeyValuePair(items, i)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010  
InferKeyKind(items)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73  func(value, 
static_cast(i), &keep_going)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182  
converter->Extend(seq, size)
{code}
vs
{code:java}
In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema)
Out[171]: 

-- is_valid: all not null
-- child 0 type: int32
  [
    1,
    1,
    1,
    1,
  

[jira] [Updated] (ARROW-17582) Relax / extend type checking for pyarrow array creation

2022-08-31 Thread Gil Forsyth (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gil Forsyth updated ARROW-17582:

Description: 
in [ibis|https://github.com/ibis-project/ibis] we're interested in offering 
query results as a record batch – some of the data we're starting with is 
coming back from a `sqlalchemy.cursor` which _look_ like `tuple`s and `dict`s 
but are actually `sqlalchemy.engine.row.LegacyRow` and 
`sqlalchemy.engine.row.RowMapping`, respectively.

 

The checks in `python_to_arrow.cc` are strict enough that these can't be 
readily dumped into an `array` without first calling, e.g. `tuple` on the 
individual rows of the results.

 

 
{code:java}
In [168]: batch[:5]
Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)]
In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", 
pa.int32())])
In [170]: pa.array(batch[:5], type=pa_schema)
---
ArrowTypeError                            Traceback (most recent call last)
Input In [170], in ()
> 1 pa.array(batch[:5], type=pa_schema)
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317,
 in pyarrow.lib.array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39,
 in pyarrow.lib._sequence_to_array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144,
 in pyarrow.lib.pyarrow_internal_check_status()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123,
 in pyarrow.lib.check_status()
ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, 
value) pair
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938  
GetKeyValuePair(items, i)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010  
InferKeyKind(items)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73  func(value, 
static_cast(i), &keep_going)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182  
converter->Extend(seq, size)
{code}
vs
{code:java}
In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema)
Out[171]: 

-- is_valid: all not null
-- child 0 type: int32
  [
    1,
    1,
    1,
    1,
    1
  ]
-- child 1 type: int32
  [
    2173,
    943,
    892,
    30,
    337
  ]{code}
To avoid the overhead of this extra conversion, maybe there are some checks 
that aren't explicit python type-checks that we can rely on?

  was:
in ibis we're interested in offering query results as a record batch – some of 
the data we're starting with is coming back from a `sqlalchemy.cursor` which 
_look_ like `tuple`s and `dict`s but are actually 
`sqlalchemy.engine.row.LegacyRow` and `sqlalchemy.engine.row.RowMapping`, 
respectively.

 

The checks in `python_to_arrow.cc` are strict enough that these can't be 
readily dumped into an `array` without first calling, e.g. `tuple` on the 
individual rows of the results.

 

 
{code:java}
In [168]: batch[:5]
Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)]
In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", 
pa.int32())])
In [170]: pa.array(batch[:5], type=pa_schema)
---
ArrowTypeError                            Traceback (most recent call last)
Input In [170], in ()
> 1 pa.array(batch[:5], type=pa_schema)
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317,
 in pyarrow.lib.array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39,
 in pyarrow.lib._sequence_to_array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144,
 in pyarrow.lib.pyarrow_internal_check_status()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123,
 in pyarrow.lib.check_status()
ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, 
value) pair
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938  
GetKeyValuePair(items, i)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010  
InferKeyKind(items)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73  func(value, 
static_cast(i), &keep_going)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182  
converter->Extend(seq, size)
{code}
 

 

vs

 
{code:java}
In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema)
Out[171]: 

-- is_valid: all not null
-- child 0 type: int32
  [
    1,
    1,
    1,
    1,
    1
  ]
-- child 1 type: int32
  [
    2173,

[jira] [Created] (ARROW-17582) Relax / extend type checking for pyarrow array creation

2022-08-31 Thread Gil Forsyth (Jira)
Gil Forsyth created ARROW-17582:
---

 Summary: Relax / extend type checking for pyarrow array creation
 Key: ARROW-17582
 URL: https://issues.apache.org/jira/browse/ARROW-17582
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Gil Forsyth


in ibis we're interested in offering query results as a record batch – some of 
the data we're starting with is coming back from a `sqlalchemy.cursor` which 
_look_ like `tuple`s and `dict`s but are actually 
`sqlalchemy.engine.row.LegacyRow` and `sqlalchemy.engine.row.RowMapping`, 
respectively.

 

The checks in `python_to_arrow.cc` are strict enough that these can't be 
readily dumped into an `array` without first calling, e.g. `tuple` on the 
individual rows of the results.

 

 
{code:java}
In [168]: batch[:5]
Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)]
In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", 
pa.int32())])
In [170]: pa.array(batch[:5], type=pa_schema)
---
ArrowTypeError                            Traceback (most recent call last)
Input In [170], in ()
> 1 pa.array(batch[:5], type=pa_schema)
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317,
 in pyarrow.lib.array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39,
 in pyarrow.lib._sequence_to_array()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144,
 in pyarrow.lib.pyarrow_internal_check_status()
File 
/nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123,
 in pyarrow.lib.check_status()
ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, 
value) pair
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938  
GetKeyValuePair(items, i)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010  
InferKeyKind(items)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73  func(value, 
static_cast(i), &keep_going)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182  
converter->Extend(seq, size)
{code}
 

 

vs

 
{code:java}
In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema)
Out[171]: 

-- is_valid: all not null
-- child 0 type: int32
  [
    1,
    1,
    1,
    1,
    1
  ]
-- child 1 type: int32
  [
    2173,
    943,
    892,
    30,
    337
  ]{code}


To avoid the overhead of this extra conversion, maybe there are some checks 
that aren't explicit python type-checks that we can rely on?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17573) [Go] Parquet ByteArray statistics cause memory leak

2022-08-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17573:
---
Labels: pull-request-available  (was: )

> [Go] Parquet ByteArray statistics cause memory leak
> ---
>
> Key: ARROW-17573
> URL: https://issues.apache.org/jira/browse/ARROW-17573
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go, Parquet
>Affects Versions: 9.0.0
>Reporter: Sasha Sirovica
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When using `arrow.BinaryTypes.String` in a schema, appending multiple 
> strings, and then writing a record out to parquet the memory of the program 
> continuously increases. This also applies for the other `arrow.BinaryTypes` 
>  
> I took a heap dump midway through the program and the majority of allocations 
> comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM 
> before terminating the program.
>  
> I was not able to replicate this behavior with just PrimativeTypes. Another 
> interesting point, if the records are created but never written with pqarrow 
> memory does not grow. In the below program commenting out `w.Write(rec)` will 
> not cause memory issues.
> Example program which causes memory to leak:
> {code:java}
> package main
> import (
>"os"
>"github.com/apache/arrow/go/v9/arrow"
>"github.com/apache/arrow/go/v9/arrow/array"
>"github.com/apache/arrow/go/v9/arrow/memory"
>"github.com/apache/arrow/go/v9/parquet"
>"github.com/apache/arrow/go/v9/parquet/compress"
>"github.com/apache/arrow/go/v9/parquet/pqarrow"
> )
> func main() {
>f, _ := os.Create("/tmp/test.parquet")
>arrowProps := pqarrow.DefaultWriterProps()
>schema := arrow.NewSchema(
>   []arrow.Field{
>  {Name: "aString", Type: arrow.BinaryTypes.String},
>   },
>   nil,
>)
>w, _ := pqarrow.NewFileWriter(schema, f, 
> parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), 
> arrowProps)
>builder := array.NewRecordBuilder(memory.DefaultAllocator, schema)
>for i := 1; i < 50; i++ {
>   builder.Field(0).(*array.StringBuilder).Append("HelloWorld!")
>   if i%200 == 0 {
>  // Write row groups out every 2M times
>  rec := builder.NewRecord()
>  w.Write(rec)
>  rec.Release()
>   }
>}
>w.Close()
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17573) [Go] Parquet ByteArray statistics cause memory leak

2022-08-31 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol updated ARROW-17573:
--
Component/s: Parquet

> [Go] Parquet ByteArray statistics cause memory leak
> ---
>
> Key: ARROW-17573
> URL: https://issues.apache.org/jira/browse/ARROW-17573
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go, Parquet
>Affects Versions: 9.0.0
>Reporter: Sasha Sirovica
>Priority: Major
>
> When using `arrow.BinaryTypes.String` in a schema, appending multiple 
> strings, and then writing a record out to parquet the memory of the program 
> continuously increases. This also applies for the other `arrow.BinaryTypes` 
>  
> I took a heap dump midway through the program and the majority of allocations 
> comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM 
> before terminating the program.
>  
> I was not able to replicate this behavior with just PrimativeTypes. Another 
> interesting point, if the records are created but never written with pqarrow 
> memory does not grow. In the below program commenting out `w.Write(rec)` will 
> not cause memory issues.
> Example program which causes memory to leak:
> {code:java}
> package main
> import (
>"os"
>"github.com/apache/arrow/go/v9/arrow"
>"github.com/apache/arrow/go/v9/arrow/array"
>"github.com/apache/arrow/go/v9/arrow/memory"
>"github.com/apache/arrow/go/v9/parquet"
>"github.com/apache/arrow/go/v9/parquet/compress"
>"github.com/apache/arrow/go/v9/parquet/pqarrow"
> )
> func main() {
>f, _ := os.Create("/tmp/test.parquet")
>arrowProps := pqarrow.DefaultWriterProps()
>schema := arrow.NewSchema(
>   []arrow.Field{
>  {Name: "aString", Type: arrow.BinaryTypes.String},
>   },
>   nil,
>)
>w, _ := pqarrow.NewFileWriter(schema, f, 
> parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), 
> arrowProps)
>builder := array.NewRecordBuilder(memory.DefaultAllocator, schema)
>for i := 1; i < 50; i++ {
>   builder.Field(0).(*array.StringBuilder).Append("HelloWorld!")
>   if i%200 == 0 {
>  // Write row groups out every 2M times
>  rec := builder.NewRecord()
>  w.Write(rec)
>  rec.Release()
>   }
>}
>w.Close()
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17573) [Go] Parquet ByteArray statistics cause memory leak

2022-08-31 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol updated ARROW-17573:
--
Summary: [Go] Parquet ByteArray statistics cause memory leak  (was: [Go] 
String Binary Builder Leaks Memory When Writing to Parquet)

> [Go] Parquet ByteArray statistics cause memory leak
> ---
>
> Key: ARROW-17573
> URL: https://issues.apache.org/jira/browse/ARROW-17573
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Affects Versions: 9.0.0
>Reporter: Sasha Sirovica
>Priority: Major
>
> When using `arrow.BinaryTypes.String` in a schema, appending multiple 
> strings, and then writing a record out to parquet the memory of the program 
> continuously increases. This also applies for the other `arrow.BinaryTypes` 
>  
> I took a heap dump midway through the program and the majority of allocations 
> comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM 
> before terminating the program.
>  
> I was not able to replicate this behavior with just PrimativeTypes. Another 
> interesting point, if the records are created but never written with pqarrow 
> memory does not grow. In the below program commenting out `w.Write(rec)` will 
> not cause memory issues.
> Example program which causes memory to leak:
> {code:java}
> package main
> import (
>"os"
>"github.com/apache/arrow/go/v9/arrow"
>"github.com/apache/arrow/go/v9/arrow/array"
>"github.com/apache/arrow/go/v9/arrow/memory"
>"github.com/apache/arrow/go/v9/parquet"
>"github.com/apache/arrow/go/v9/parquet/compress"
>"github.com/apache/arrow/go/v9/parquet/pqarrow"
> )
> func main() {
>f, _ := os.Create("/tmp/test.parquet")
>arrowProps := pqarrow.DefaultWriterProps()
>schema := arrow.NewSchema(
>   []arrow.Field{
>  {Name: "aString", Type: arrow.BinaryTypes.String},
>   },
>   nil,
>)
>w, _ := pqarrow.NewFileWriter(schema, f, 
> parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), 
> arrowProps)
>builder := array.NewRecordBuilder(memory.DefaultAllocator, schema)
>for i := 1; i < 50; i++ {
>   builder.Field(0).(*array.StringBuilder).Append("HelloWorld!")
>   if i%200 == 0 {
>  // Write row groups out every 2M times
>  rec := builder.NewRecord()
>  w.Write(rec)
>  rec.Release()
>   }
>}
>w.Close()
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17581) [R] Refactor build_expr and eval_array_expression to remove special casing

2022-08-31 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-17581:
---

 Summary: [R] Refactor build_expr and eval_array_expression to 
remove special casing
 Key: ARROW-17581
 URL: https://issues.apache.org/jira/browse/ARROW-17581
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 10.0.0


As [~paleolimbot] observes 
[here|https://github.com/apache/arrow/pull/13985#discussion_r957286453], we 
should avoid adding additional complexity or indirection in how 
expressions/bindings are defined--it's complex enough as is. We have helper 
functions {{build_expr}} (used with Acero, wrapper around Expression$create, 
returns Expression) and {{eval_array_expression}} (for eager computation on 
Arrays, wrapper around call_function) that wrap input arguments as Scalars or 
whatever, but they also do some special casing for functions that need custom 
handling. 

However, since those functions were initially written, we've developed other 
ways to handle these special cases more explicitly, and not all operations pass 
through these helper functions. We should pull out the special cases and define 
those functions/bindings explicitly and only use these helpers in the simple 
case where no extra logic is required.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16728) [Python] Switch default and deprecate use_legacy_dataset=True in ParquetDataset

2022-08-31 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-16728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598414#comment-17598414
 ] 

Raúl Cumplido commented on ARROW-16728:
---

[~jorisvandenbossche] was your idea for this one to be done on two different 
releases?
First - DeprecationWarning and switch default to use_legacy_dataset=False

Second - Remove possibility of using use_legacy_dataset=True

> [Python] Switch default and deprecate use_legacy_dataset=True in 
> ParquetDataset
> ---
>
> Key: ARROW-16728
> URL: https://issues.apache.org/jira/browse/ARROW-16728
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> The ParquetDataset() constructor itself still defaults to 
> {{use_legacy_dataset=True}} (although using specific attributes or keywords 
> related to that will raise a warning). So a next step will be to actually 
> deprecate passing that and switching the default, and then only afterwards we 
> can remove the code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15691) [Dev] Update archery to work with either master or main as default branch

2022-08-31 Thread Fiona La (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fiona La reassigned ARROW-15691:


Assignee: Fiona La

> [Dev] Update archery to work with either master or main as default branch
> -
>
> Key: ARROW-15691
> URL: https://issues.apache.org/jira/browse/ARROW-15691
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Developer Tools
>Reporter: Neal Richardson
>Assignee: Fiona La
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-31 Thread Arthur Passos (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598383#comment-17598383
 ] 

Arthur Passos edited comment on ARROW-17459 at 8/31/22 2:31 PM:


[~emkornfield] if I understand correctly, this could help with the original 
case I shared. In the case [~willjones127] shared, where he creates a 
ChunkedArray and then serializes it, it wouldn't help. Is that correct?

I am stating this based on my current understanding of the inner workings of 
`arrow`: The ChunkedArray data structure will be used in two or more 
situations: 
1. The data in a row group exceeds the limit of INT_MAX (Case I initially 
shared)
2. The serialized data/ table is a chunked array, thus it makes sense to use a 
chunked array.

 

edit:

I have just tested the snippet shared by Will Jones using `type = 
pa.map_(pa.large_string(), pa.int64())` instead of `type = pa.map_(pa.string(), 
pa.int32())` and the issue persists. 

 


was (Author: JIRAUSER294600):
[~emkornfield] if I understand correctly, this could help with the original 
case I shared. In the case [~willjones127] shared, where he creates a 
ChunkedArray and then serializes it, it wouldn't help. Is that correct?

I am stating this based on my current understanding of the inner workings of 
`arrow`: The ChunkedArray data structure will be used in two or more 
situations: 
1. The data in a row group exceeds the limit of INT_MAX (Case I initially 
shared)
2. The serialized data/ table is a chunked array, thus it makes sense to use a 
chunked array.

 

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17541) [R] Substantial RAM use increase in 9.0.0 release on write_dataset()

2022-08-31 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598406#comment-17598406
 ] 

Weston Pace commented on ARROW-17541:
-

> Would a more precise way to say this be that there is some shared pointer 
> (potentially held by an R6 object that is still in scope and not being 
> destroyed) that is keeping the record batches from being freed? We do have an 
> R reference to the exec plan and the final node of the exec plan (which would 
> be the penultimate node in the dataset write, which is probably the scan 
> node). (It still makes no sense to me why the batches aren't getting 
> released).

Yes, I think that is a more precise way.  Holding onto the ExecPlan (which owns 
the nodes too) should be ok (indeed, desirable).

> [R] Substantial RAM use increase in 9.0.0 release on write_dataset()
> 
>
> Key: ARROW-17541
> URL: https://issues.apache.org/jira/browse/ARROW-17541
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: Carl Boettiger
>Priority: Critical
> Attachments: Screenshot 2022-08-30 at 14-23-20 Online Graph Maker · 
> Plotly Chart Studio.png
>
>
> Consider the following example of opening a remote dataset (a single 4 GB 
> parquet file) and streaming it to disk. Consider this reprex:
>  
> {code:java}
> s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", 
> anonymous=TRUE)
> df <- arrow::open_dataset(s3$path("waq_test"))
> arrow::write_dataset(df, tempfile())
>  {code}
> In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already 
> surprisingly high (when the whole file is 4 GB when on disk), but on arrow 
> 9.0.0 RAM use for the same operation approximately doubles, which is large 
> enough to trigger the OOM killer on the task in several of our active 
> production workflows. 
>  
> Can this large RAM use increase introduced in 9.0 be avoided?  Is it possible 
> for this operation to use even less RAM than it does in 8.0 release?  Is 
> there something about this particular parquet file that should be responsible 
> for the large RAM use? 
>  
> Arrow's impressively fast performance on large data on remote hosts is really 
> game-changing for us.  Still, the OOM errors are a bit unexpected at this 
> scale (i.e. single 4GB parquet file), as R users we really depend on arrow's 
> out-of-band operations to work with larger-than-RAM data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17541) [R] Substantial RAM use increase in 9.0.0 release on write_dataset()

2022-08-31 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-17541:

Priority: Critical  (was: Major)

> [R] Substantial RAM use increase in 9.0.0 release on write_dataset()
> 
>
> Key: ARROW-17541
> URL: https://issues.apache.org/jira/browse/ARROW-17541
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: Carl Boettiger
>Priority: Critical
> Attachments: Screenshot 2022-08-30 at 14-23-20 Online Graph Maker · 
> Plotly Chart Studio.png
>
>
> Consider the following example of opening a remote dataset (a single 4 GB 
> parquet file) and streaming it to disk. Consider this reprex:
>  
> {code:java}
> s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", 
> anonymous=TRUE)
> df <- arrow::open_dataset(s3$path("waq_test"))
> arrow::write_dataset(df, tempfile())
>  {code}
> In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already 
> surprisingly high (when the whole file is 4 GB when on disk), but on arrow 
> 9.0.0 RAM use for the same operation approximately doubles, which is large 
> enough to trigger the OOM killer on the task in several of our active 
> production workflows. 
>  
> Can this large RAM use increase introduced in 9.0 be avoided?  Is it possible 
> for this operation to use even less RAM than it does in 8.0 release?  Is 
> there something about this particular parquet file that should be responsible 
> for the large RAM use? 
>  
> Arrow's impressively fast performance on large data on remote hosts is really 
> game-changing for us.  Still, the OOM errors are a bit unexpected at this 
> scale (i.e. single 4GB parquet file), as R users we really depend on arrow's 
> out-of-band operations to work with larger-than-RAM data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17579) [Python] PYARROW_CXXFLAGS ignored?

2022-08-31 Thread Alenka Frim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598400#comment-17598400
 ] 

Alenka Frim commented on ARROW-17579:
-

Oh, got it. Not sure I can help but will for sure dig into it!

> [Python] PYARROW_CXXFLAGS ignored?
> --
>
> Key: ARROW-17579
> URL: https://issues.apache.org/jira/browse/ARROW-17579
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> In {{setup.py}}, I see that the {{PYARROW_CXXFLAGS}} environment variable is 
> read, but its value then seems to be ignored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17579) [Python] PYARROW_CXXFLAGS ignored?

2022-08-31 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598393#comment-17598393
 ] 

Antoine Pitrou commented on ARROW-17579:


Even if I try to pass {{PYARROW_CXXFLAGS}} directly to CMake, it seems to be 
used for linking but not compiling, as reported in ARROW-17580.

> [Python] PYARROW_CXXFLAGS ignored?
> --
>
> Key: ARROW-17579
> URL: https://issues.apache.org/jira/browse/ARROW-17579
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> In {{setup.py}}, I see that the {{PYARROW_CXXFLAGS}} environment variable is 
> read, but its value then seems to be ignored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17568) [FlightRPC][Integration] Ensure all RPC methods are covered by integration testing

2022-08-31 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598392#comment-17598392
 ] 

David Li commented on ARROW-17568:
--

https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/integration_tests/test_integration.cc
 and 
https://github.com/apache/arrow/blob/cf27001da088d882a7d460cddd84a0202f3d8eba/dev/archery/archery/integration/runner.py#L424-L438

> [FlightRPC][Integration] Ensure all RPC methods are covered by integration 
> testing
> --
>
> Key: ARROW-17568
> URL: https://issues.apache.org/jira/browse/ARROW-17568
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC, Go, Integration, Java
>Reporter: David Li
>Priority: Major
>
> This would help catch issues like https://github.com/apache/arrow/issues/13853



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17579) [Python] PYARROW_CXXFLAGS ignored?

2022-08-31 Thread Alenka Frim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598391#comment-17598391
 ] 

Alenka Frim commented on ARROW-17579:
-

I guess so. I think {{self.cmake_cxxflags}} should be added to 
{{cmake_options}} in {{_run_cmake_pyarrow_cpp}} and {{_run_cmake}} to be red by 
the CMake when building.

> [Python] PYARROW_CXXFLAGS ignored?
> --
>
> Key: ARROW-17579
> URL: https://issues.apache.org/jira/browse/ARROW-17579
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> In {{setup.py}}, I see that the {{PYARROW_CXXFLAGS}} environment variable is 
> read, but its value then seems to be ignored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-31 Thread Arthur Passos (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598383#comment-17598383
 ] 

Arthur Passos commented on ARROW-17459:
---

[~emkornfield] if I understand correctly, this could help with the original 
case I shared. In the case [~willjones127] shared, where he creates a 
ChunkedArray and then serializes it, it wouldn't help. Is that correct?

I am stating this based on my current understanding of the inner workings of 
`arrow`: The ChunkedArray data structure will be used in two or more 
situations: 
1. The data in a row group exceeds the limit of INT_MAX (Case I initially 
shared)
2. The serialized data/ table is a chunked array, thus it makes sense to use a 
chunked array.

 

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17580) [Doc][C++][Python] Unclear how to influence compilation flags

2022-08-31 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598375#comment-17598375
 ] 

Antoine Pitrou commented on ARROW-17580:


Another problem seems to be that PyArrow uses two independent CMakeLists files 
that are not related to each other (meaning two different sets of CMake 
invocations, with different possible options...).

> [Doc][C++][Python] Unclear how to influence compilation flags
> -
>
> Key: ARROW-17580
> URL: https://issues.apache.org/jira/browse/ARROW-17580
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Python
>Reporter: Antoine Pitrou
>Priority: Critical
>
> Frequently people need to customize compilation flags for C++ and/or C files.
> Unfortunately, both for Arrow C++ and PyArrow, it is very difficult to find 
> out the proper way to do this.
> For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while 
> the {{CXXFLAGS}} environment variable is ignored (it probably shouldn't?).
> For PyArrow, I have not found a way to do it. The {{CXXFLAGS}} environment 
> variable is ignored, and the {{PYARROW_CXXFLAGS}} CMake variable has two 
> problems:
> * it is only recognized for Cython-generated files, not for PyArrow C++ 
> sources
> * it only affects linker calls, while it should actually affect compiler calls



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17580) [Doc][C++][Python] Unclear how to influence compilation flags

2022-08-31 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-17580:
--

 Summary: [Doc][C++][Python] Unclear how to influence compilation 
flags
 Key: ARROW-17580
 URL: https://issues.apache.org/jira/browse/ARROW-17580
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Documentation, Python
Reporter: Antoine Pitrou


Frequently people need to customize compilation flags for C++ and/or C files.

Unfortunately, both for Arrow C++ and PyArrow, it is very difficult to find out 
the proper way to do this.

For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while the 
{{CXXFLAGS}} environment variable is ignored (it probably shouldn't?).

For PyArrow, I have not found a way to do it. The {{CXXFLAGS}} environment is 
ignored, and the {{PYARROW_CXXFLAGS}} has two problems:
* it is only recognized for Cython-generated files, not for PyArrow C++ sources
* it only affects linker calls, while it should actually affect compiler calls




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17580) [Doc][C++][Python] Unclear how to influence compilation flags

2022-08-31 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598372#comment-17598372
 ] 

Antoine Pitrou commented on ARROW-17580:


cc [~alenka] [~kou]

> [Doc][C++][Python] Unclear how to influence compilation flags
> -
>
> Key: ARROW-17580
> URL: https://issues.apache.org/jira/browse/ARROW-17580
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Python
>Reporter: Antoine Pitrou
>Priority: Critical
>
> Frequently people need to customize compilation flags for C++ and/or C files.
> Unfortunately, both for Arrow C++ and PyArrow, it is very difficult to find 
> out the proper way to do this.
> For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while 
> the {{CXXFLAGS}} environment variable is ignored (it probably shouldn't?).
> For PyArrow, I have not found a way to do it. The {{CXXFLAGS}} environment is 
> ignored, and the {{PYARROW_CXXFLAGS}} has two problems:
> * it is only recognized for Cython-generated files, not for PyArrow C++ 
> sources
> * it only affects linker calls, while it should actually affect compiler calls



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17580) [Doc][C++][Python] Unclear how to influence compilation flags

2022-08-31 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-17580:
---
Description: 
Frequently people need to customize compilation flags for C++ and/or C files.

Unfortunately, both for Arrow C++ and PyArrow, it is very difficult to find out 
the proper way to do this.

For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while the 
{{CXXFLAGS}} environment variable is ignored (it probably shouldn't?).

For PyArrow, I have not found a way to do it. The {{CXXFLAGS}} environment 
variable is ignored, and the {{PYARROW_CXXFLAGS}} CMake variable has two 
problems:
* it is only recognized for Cython-generated files, not for PyArrow C++ sources
* it only affects linker calls, while it should actually affect compiler calls


  was:
Frequently people need to customize compilation flags for C++ and/or C files.

Unfortunately, both for Arrow C++ and PyArrow, it is very difficult to find out 
the proper way to do this.

For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while the 
{{CXXFLAGS}} environment variable is ignored (it probably shouldn't?).

For PyArrow, I have not found a way to do it. The {{CXXFLAGS}} environment is 
ignored, and the {{PYARROW_CXXFLAGS}} has two problems:
* it is only recognized for Cython-generated files, not for PyArrow C++ sources
* it only affects linker calls, while it should actually affect compiler calls



> [Doc][C++][Python] Unclear how to influence compilation flags
> -
>
> Key: ARROW-17580
> URL: https://issues.apache.org/jira/browse/ARROW-17580
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Python
>Reporter: Antoine Pitrou
>Priority: Critical
>
> Frequently people need to customize compilation flags for C++ and/or C files.
> Unfortunately, both for Arrow C++ and PyArrow, it is very difficult to find 
> out the proper way to do this.
> For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while 
> the {{CXXFLAGS}} environment variable is ignored (it probably shouldn't?).
> For PyArrow, I have not found a way to do it. The {{CXXFLAGS}} environment 
> variable is ignored, and the {{PYARROW_CXXFLAGS}} CMake variable has two 
> problems:
> * it is only recognized for Cython-generated files, not for PyArrow C++ 
> sources
> * it only affects linker calls, while it should actually affect compiler calls



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17579) [Python] PYARROW_CXXFLAGS ignored?

2022-08-31 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-17579:
--

 Summary: [Python] PYARROW_CXXFLAGS ignored?
 Key: ARROW-17579
 URL: https://issues.apache.org/jira/browse/ARROW-17579
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Antoine Pitrou


In {{setup.py}}, I see that the {{PYARROW_CXXFLAGS}} environment variable is 
read, but its value then seems to be ignored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17330) [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array concatenation

2022-08-31 Thread Alexey Smirnov (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Smirnov reassigned ARROW-17330:
--

Assignee: (was: Alexey Smirnov)

> [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array 
> concatenation
> ---
>
> Key: ARROW-17330
> URL: https://issues.apache.org/jira/browse/ARROW-17330
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Affects Versions: 8.0.0
>Reporter: Alexey Smirnov
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Extend ArrowBuffer.BitmapBuilder with Append method overloaded with 
> ReadOnlySpan parameter. 
> This allows to add validity bits to the builder more efficiently (especially 
> for cases when initial validity bits are added to newly created empty 
> builder).  More over it makes BitmapBuilder API more consistent (for example 
> ArrowBuffer.Builder does have such method).
> Currently adding new bits to existing bitmap is implemented in 
> ArrayDataConcatenator, Code adds bit by bit in a cycle converting each to a 
> boolean value:
> for (int i = 0; i < length; i++)
> {
> builder.Append(span.IsEmpty || BitUtility.GetBit(span, i));
> }
> Initial problem was described in this email: 
> https://lists.apache.org/thread/kls6tjq2hclsvd16tw901ooo5soojrmb
> PR: https://github.com/apache/arrow/pull/13810



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17330) [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array concatenation

2022-08-31 Thread Alexey Smirnov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598339#comment-17598339
 ] 

Alexey Smirnov edited comment on ARROW-17330 at 8/31/22 11:31 AM:
--

Pull request is ready. Issue requires verification and approval


was (Author: JIRAUSER293436):
Pull request is ready. Issue requires verification approval

> [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array 
> concatenation
> ---
>
> Key: ARROW-17330
> URL: https://issues.apache.org/jira/browse/ARROW-17330
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Affects Versions: 8.0.0
>Reporter: Alexey Smirnov
>Assignee: Alexey Smirnov
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Extend ArrowBuffer.BitmapBuilder with Append method overloaded with 
> ReadOnlySpan parameter. 
> This allows to add validity bits to the builder more efficiently (especially 
> for cases when initial validity bits are added to newly created empty 
> builder).  More over it makes BitmapBuilder API more consistent (for example 
> ArrowBuffer.Builder does have such method).
> Currently adding new bits to existing bitmap is implemented in 
> ArrayDataConcatenator, Code adds bit by bit in a cycle converting each to a 
> boolean value:
> for (int i = 0; i < length; i++)
> {
> builder.Append(span.IsEmpty || BitUtility.GetBit(span, i));
> }
> Initial problem was described in this email: 
> https://lists.apache.org/thread/kls6tjq2hclsvd16tw901ooo5soojrmb
> PR: https://github.com/apache/arrow/pull/13810



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (ARROW-17330) [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array concatenation

2022-08-31 Thread Alexey Smirnov (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Smirnov reopened ARROW-17330:


Pull request is ready. Issue requires verification approval

> [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array 
> concatenation
> ---
>
> Key: ARROW-17330
> URL: https://issues.apache.org/jira/browse/ARROW-17330
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Affects Versions: 8.0.0
>Reporter: Alexey Smirnov
>Assignee: Alexey Smirnov
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Extend ArrowBuffer.BitmapBuilder with Append method overloaded with 
> ReadOnlySpan parameter. 
> This allows to add validity bits to the builder more efficiently (especially 
> for cases when initial validity bits are added to newly created empty 
> builder).  More over it makes BitmapBuilder API more consistent (for example 
> ArrowBuffer.Builder does have such method).
> Currently adding new bits to existing bitmap is implemented in 
> ArrayDataConcatenator, Code adds bit by bit in a cycle converting each to a 
> boolean value:
> for (int i = 0; i < length; i++)
> {
> builder.Append(span.IsEmpty || BitUtility.GetBit(span, i));
> }
> Initial problem was described in this email: 
> https://lists.apache.org/thread/kls6tjq2hclsvd16tw901ooo5soojrmb
> PR: https://github.com/apache/arrow/pull/13810



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17330) [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array concatenation

2022-08-31 Thread Alexey Smirnov (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Smirnov resolved ARROW-17330.

Resolution: Implemented

> [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array 
> concatenation
> ---
>
> Key: ARROW-17330
> URL: https://issues.apache.org/jira/browse/ARROW-17330
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Affects Versions: 8.0.0
>Reporter: Alexey Smirnov
>Assignee: Alexey Smirnov
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Extend ArrowBuffer.BitmapBuilder with Append method overloaded with 
> ReadOnlySpan parameter. 
> This allows to add validity bits to the builder more efficiently (especially 
> for cases when initial validity bits are added to newly created empty 
> builder).  More over it makes BitmapBuilder API more consistent (for example 
> ArrowBuffer.Builder does have such method).
> Currently adding new bits to existing bitmap is implemented in 
> ArrayDataConcatenator, Code adds bit by bit in a cycle converting each to a 
> boolean value:
> for (int i = 0; i < length; i++)
> {
> builder.Append(span.IsEmpty || BitUtility.GetBit(span, i));
> }
> Initial problem was described in this email: 
> https://lists.apache.org/thread/kls6tjq2hclsvd16tw901ooo5soojrmb
> PR: https://github.com/apache/arrow/pull/13810



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17330) [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array concatenation

2022-08-31 Thread Alexey Smirnov (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Smirnov reassigned ARROW-17330:
--

Assignee: Alexey Smirnov

> [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array 
> concatenation
> ---
>
> Key: ARROW-17330
> URL: https://issues.apache.org/jira/browse/ARROW-17330
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Affects Versions: 8.0.0
>Reporter: Alexey Smirnov
>Assignee: Alexey Smirnov
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Extend ArrowBuffer.BitmapBuilder with Append method overloaded with 
> ReadOnlySpan parameter. 
> This allows to add validity bits to the builder more efficiently (especially 
> for cases when initial validity bits are added to newly created empty 
> builder).  More over it makes BitmapBuilder API more consistent (for example 
> ArrowBuffer.Builder does have such method).
> Currently adding new bits to existing bitmap is implemented in 
> ArrayDataConcatenator, Code adds bit by bit in a cycle converting each to a 
> boolean value:
> for (int i = 0; i < length; i++)
> {
> builder.Append(span.IsEmpty || BitUtility.GetBit(span, i));
> }
> Initial problem was described in this email: 
> https://lists.apache.org/thread/kls6tjq2hclsvd16tw901ooo5soojrmb
> PR: https://github.com/apache/arrow/pull/13810



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17578) [CI] Nightly test-r-gcc-12 fails to build

2022-08-31 Thread Jira
Raúl Cumplido created ARROW-17578:
-

 Summary: [CI] Nightly test-r-gcc-12 fails to build
 Key: ARROW-17578
 URL: https://issues.apache.org/jira/browse/ARROW-17578
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Raúl Cumplido


[test-r-gcc-12|https://github.com/ursacomputing/crossbow/runs/8104457062?check_suite_focus=true]
 has been failing to build since the 18th of August. The current error log is:
{code:java}
 #4 ERROR: executor failed running [/bin/bash -o pipefail -c apt-get update -y 
&&     apt-get install -y         dirmngr         apt-transport-https         
software-properties-common &&     wget -qO- 
https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc |         tee 
-a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc &&     add-apt-repository 'deb 
https://cloud.r-project.org/bin/linux/ubuntu '$(lsb_release -cs)'-cran40/' &&   
  apt-get install -y         r-base=${r}*         r-recommended=${r}*         
libxml2-dev         libgit2-dev         libssl-dev         clang         
clang-format         clang-tidy         texlive-latex-base         locales      
   python3         python3-pip         python3-dev &&     locale-gen 
en_US.UTF-8 &&     apt-get clean &&     rm -rf /var/lib/apt/lists/*]: exit 
code: 100
--
 > [ 2/17] RUN apt-get update -y &&     apt-get install -y         dirmngr      
   apt-transport-https         software-properties-common &&     wget -qO- 
https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc |         tee 
-a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc &&     add-apt-repository 'deb 
https://cloud.r-project.org/bin/linux/ubuntu '$(lsb_release -cs)'-cran40/' &&   
  apt-get install -y         r-base=4.2*         r-recommended=4.2*         
libxml2-dev         libgit2-dev         libssl-dev         clang         
clang-format         clang-tidy         texlive-latex-base         locales      
   python3         python3-pip         python3-dev &&     locale-gen 
en_US.UTF-8 &&     apt-get clean &&     rm -rf /var/lib/apt/lists/*:
--
executor failed running [/bin/bash -o pipefail -c apt-get update -y &&     
apt-get install -y         dirmngr         apt-transport-https         
software-properties-common &&     wget -qO- 
https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc |         tee 
-a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc &&     add-apt-repository 'deb 
https://cloud.r-project.org/bin/linux/ubuntu '$(lsb_release -cs)'-cran40/' &&   
  apt-get install -y         r-base=${r}*         r-recommended=${r}*         
libxml2-dev         libgit2-dev         libssl-dev         clang         
clang-format         clang-tidy         texlive-latex-base         locales      
   python3         python3-pip         python3-dev &&     locale-gen 
en_US.UTF-8 &&     apt-get clean &&     rm -rf /var/lib/apt/lists/*]: exit 
code: 100
Service 'ubuntu-r-only-r' failed to build : Build failed {code}
I can't reproduce locally (I get a different error) but could it be that we are 
pulling an outdated upstream docker image?

I don't think the changes introduced on the first build failure are relevant to 
this failure but here they are:
https://github.com/apache/arrow/compare/6e8f0e4d327180375dda53287a5a600ba139ce3d...a1c3d57af514d4a84e753ff51df8e563135ee55e

cc~ [~kou] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17483) Support for 'pa.compute.Expression' in filter argument to 'pa.read_table'

2022-08-31 Thread Miles Granger (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miles Granger reassigned ARROW-17483:
-

Assignee: Miles Granger

> Support for 'pa.compute.Expression' in filter argument to 'pa.read_table'
> -
>
> Key: ARROW-17483
> URL: https://issues.apache.org/jira/browse/ARROW-17483
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Patrik Kjærran
>Assignee: Miles Granger
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, the _filters_ argument supports {{{}List{}}}[{{{}Tuple{}}}] or 
> {{{}List{}}}[{{{}List{}}}[{{{}Tuple{}}}]] or None as its input types. I was 
> suprised to see that Expressions were not supported, considering that filters 
> are converted to expressions internally when using use_legacy_dataset=False.
> The check on 
> [L150-L153|https://github.com/apache/arrow/blob/28cf3f9f769dda11ddfe52bd316c96aecb656522/python/pyarrow/parquet/core.py#L150-L153]
>  short-circuits and succeeds when encountering an expression, but later fails 
> on 
> [L2343|https://github.com/apache/arrow/blob/28cf3f9f769dda11ddfe52bd316c96aecb656522/python/pyarrow/parquet/core.py#L2343]
>  as the expression is evaluated as part of a boolean expression. 
> I think declaring filters using pa.compute.Expressions more pythonic and less 
> error-prone,  and ill-formed filters will be detected much earlier than when 
> using list-of-tuple-of-string equivalents.
> *Example:*
> {code:java}
> import pyarrow as pa
> import pyarrow.compute as pc
> import pyarrow.parquet as pq
> # Creating a dummy table
> table = pa.table({
>     'year': [2020, 2022, 2021, 2022, 2019, 2021],
>     'n_legs': [2, 2, 4, 4, 5, 100],
>     'animal': ["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", 
> "Centipede"]
> })
> pq.write_to_dataset(table, root_path='dataset_name_2', 
> partition_cols=['year'])
> # Reading using 'pyarrow.compute.Expression'
> pq.read_table('dataset_name_2', columns=["n_legs", "animal"], 
> filters=pc.field("n_legs") < 4)
> # Reading using List[Tuple]
> pq.read_table('dataset_name_2', columns=["n_legs", "animal"], 
> filters=[('n_legs', '<', 4)])  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17483) Support for 'pa.compute.Expression' in filter argument to 'pa.read_table'

2022-08-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17483:
---
Labels: pull-request-available  (was: )

> Support for 'pa.compute.Expression' in filter argument to 'pa.read_table'
> -
>
> Key: ARROW-17483
> URL: https://issues.apache.org/jira/browse/ARROW-17483
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Patrik Kjærran
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, the _filters_ argument supports {{{}List{}}}[{{{}Tuple{}}}] or 
> {{{}List{}}}[{{{}List{}}}[{{{}Tuple{}}}]] or None as its input types. I was 
> suprised to see that Expressions were not supported, considering that filters 
> are converted to expressions internally when using use_legacy_dataset=False.
> The check on 
> [L150-L153|https://github.com/apache/arrow/blob/28cf3f9f769dda11ddfe52bd316c96aecb656522/python/pyarrow/parquet/core.py#L150-L153]
>  short-circuits and succeeds when encountering an expression, but later fails 
> on 
> [L2343|https://github.com/apache/arrow/blob/28cf3f9f769dda11ddfe52bd316c96aecb656522/python/pyarrow/parquet/core.py#L2343]
>  as the expression is evaluated as part of a boolean expression. 
> I think declaring filters using pa.compute.Expressions more pythonic and less 
> error-prone,  and ill-formed filters will be detected much earlier than when 
> using list-of-tuple-of-string equivalents.
> *Example:*
> {code:java}
> import pyarrow as pa
> import pyarrow.compute as pc
> import pyarrow.parquet as pq
> # Creating a dummy table
> table = pa.table({
>     'year': [2020, 2022, 2021, 2022, 2019, 2021],
>     'n_legs': [2, 2, 4, 4, 5, 100],
>     'animal': ["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", 
> "Centipede"]
> })
> pq.write_to_dataset(table, root_path='dataset_name_2', 
> partition_cols=['year'])
> # Reading using 'pyarrow.compute.Expression'
> pq.read_table('dataset_name_2', columns=["n_legs", "animal"], 
> filters=pc.field("n_legs") < 4)
> # Reading using List[Tuple]
> pq.read_table('dataset_name_2', columns=["n_legs", "animal"], 
> filters=[('n_legs', '<', 4)])  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17571) [Benchmarks] Default build for PyArrow seems to be debug

2022-08-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17571:
---
Labels: pull-request-available  (was: )

> [Benchmarks] Default build for PyArrow seems to be debug
> 
>
> Key: ARROW-17571
> URL: https://issues.apache.org/jira/browse/ARROW-17571
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Benchmarking, Python
>Reporter: Alenka Frim
>Assignee: Alenka Frim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After a benchmark regression was identified in the [Python refactoring 
> PR|https://github.com/apache/arrow/pull/13311] we identified the cause is in 
> the build script for benchmarks. In the file _dev/conbench_envs/hooks.sh_ the 
> script used to build PyArrow is _ci/scripts/python_build.sh_ where the 
> default for PyArrow build type is set to *debug* (assuming _CMAKE_BUILD_TYPE_ 
> isn't defined)
> See:
> [https://github.com/apache/arrow/blob/74dae618ed8d6b492bf3b88e3b9b7dfd4c21e8d8/dev/conbench_envs/hooks.sh#L60-L62]
> [https://github.com/apache/arrow/blob/93b63e8f3b4880927ccbd5522c967df79e926cda/ci/scripts/python_build.sh#L55]
>  
> I think we need to change the build type to release in 
> _dev/conbench_envs/hooks.sh_ (_build_arrow_python()_) or maybe better to set 
> the variable _CMAKE_BUILD_TYPE_ to release in 
> _dev/conbench_envs/benchmarks.env_.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17577) [C++][Python] CMake cannot find Arrow/Arrow Python when building PyArrow

2022-08-31 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-17577:
---

 Summary: [C++][Python] CMake cannot find Arrow/Arrow Python when 
building PyArrow
 Key: ARROW-17577
 URL: https://issues.apache.org/jira/browse/ARROW-17577
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Alenka Frim


When building on master yesterday the PyArrow built worked fine. Today there is 
an issue with CMake unable to find packages. See:

 
{code:java}
-- Finished CMake build and install for PyArrow C++
creating /Users/alenkafrim/repos/arrow/python/build/temp.macosx-12-arm64-3.9
-- Running cmake for PyArrow
cmake -DPYTHON_EXECUTABLE=/Users/alenkafrim/repos/pyarrow-dev-9/bin/python 
-DPython3_EXECUTABLE=/Users/alenkafrim/repos/pyarrow-dev-9/bin/python 
-DPYARROW_CPP_HOME=/Users/alenkafrim/repos/arrow/python/build/dist "" 
-DPYARROW_BUILD_CUDA=off -DPYARROW_BUILD_SUBSTRAIT=off 
-DPYARROW_BUILD_FLIGHT=on -DPYARROW_BUILD_GANDIVA=off 
-DPYARROW_BUILD_DATASET=on -DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=on 
-DPYARROW_BUILD_PARQUET_ENCRYPTION=off -DPYARROW_BUILD_PLASMA=off 
-DPYARROW_BUILD_GCS=off -DPYARROW_BUILD_S3=on -DPYARROW_BUILD_HDFS=off 
-DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off 
-DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off 
-DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on 
-DCMAKE_BUILD_TYPE=release /Users/alenkafrim/repos/arrow/python
CMake Warning:
  Ignoring empty string ("") provided on the command line.




-- The C compiler identification is AppleClang 13.1.6.13160021
-- The CXX compiler identification is AppleClang 13.1.6.13160021
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: 
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
 - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: 
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
 - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- System processor: arm64
-- Performing Test CXX_SUPPORTS_ARMV8_ARCH
-- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Success
-- Arrow build warning level: PRODUCTION
-- Configured for RELEASE build (set with cmake 
-DCMAKE_BUILD_TYPE={release,debug,...})
-- Build Type: RELEASE
-- Generator: Unix Makefiles
-- Build output directory: 
/Users/alenkafrim/repos/arrow/python/build/temp.macosx-12-arm64-3.9/release
-- Found Python3: /Users/alenkafrim/repos/pyarrow-dev-9/bin/python (found 
version "3.9.13") found components: Interpreter Development.Module NumPy 
-- Found Python3Alt: /Users/alenkafrim/repos/pyarrow-dev-9/bin/python  
CMake Error at 
/opt/homebrew/Cellar/cmake/3.24.1/share/cmake/Modules/CMakeFindDependencyMacro.cmake:47
 (find_package):
  By not providing "FindArrow.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "Arrow", but
  CMake did not find one.


  Could not find a package configuration file provided by "Arrow" with any of
  the following names:


    ArrowConfig.cmake
    arrow-config.cmake


  Add the installation prefix of "Arrow" to CMAKE_PREFIX_PATH or set
  "Arrow_DIR" to a directory containing one of the above files.  If "Arrow"
  provides a separate development package or SDK, be sure it has been
  installed.
Call Stack (most recent call first):
  build/dist/lib/cmake/ArrowPython/ArrowPythonConfig.cmake:54 (find_dependency)
  CMakeLists.txt:240 (find_package)

{code}
I did a clean built on the latest master. Am I missing some variables that need 
to be set after [https://github.com/apache/arrow/pull/13892] ?

I am calling cmake with these flags:

 
{code:java}
cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
              -DCMAKE_INSTALL_LIBDIR=lib \
              -DCMAKE_BUILD_TYPE=debug \
              -DARROW_WITH_BZ2=ON \
              -DARROW_WITH_ZLIB=ON \
              -DARROW_WITH_ZSTD=ON \
              -DARROW_WITH_LZ4=ON \
              -DARROW_WITH_SNAPPY=ON \
              -DARROW_WITH_BROTLI=ON \
              -DARROW_PLASMA=OFF \
              -DARROW_PARQUET=ON \
              -DPARQUET_REQUIRE_ENCRYPTION=OFF \
              -DARROW_PYTHON=ON \
              -DARROW_FLIGHT=ON \
              -DARROW_JEMALLOC=OFF \
              -DARROW_S3=ON \
              -DARROW_GCS=OFF \
              -DARROW_BUILD_TESTS=ON \
              -DARROW_DEPENDENCY_SOURCE=AUTO \
              -DARROW_INSTALL_NAME_RPATH=OFF \
              -DARROW_EXTRA_ERROR_CONTEXT=ON \
              -GNinja \
              ..
        popd {code}
and building python with 
{code:java}
python setup.py build_ext --inplace {code}
 

 

cc [~kou] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17543) [R] %in% on an empty vector c() fails

2022-08-31 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane resolved ARROW-17543.
--
Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13990
[https://github.com/apache/arrow/pull/13990]

> [R] %in% on an empty vector c() fails
> -
>
> Key: ARROW-17543
> URL: https://issues.apache.org/jira/browse/ARROW-17543
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: Egill Axfjord Fridgeirsson
>Assignee: Egill Axfjord Fridgeirsson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When using %in% on empty vectors I'm getting an error
> "Error: Cannot infer type from vector"
> I'd expect this to work the same as base R where you can use %in% on empty 
> vectors.
> The arrow::is_in compute function does accept nulls as the value_set. If I 
> manually create an empty array of type NULL it does work as expected.
> Reprex:
> {code:java}
> library(dplyr)
> library(arrow)
> options(arrow.debug=T)
> #base R
> a <- c(1,2,3)
> b <- c() # NULL
> a %in% b
> #> [1] FALSE FALSE FALSE
> # arrow arrays
> arrowArray <- arrow::Array$create(c(1,2,3))
> arrow::is_in(arrowArray, c())
> #> Error: Cannot infer type from vector
> # define type of c() manually
> arrow::is_in(arrowArray, arrow::Array$create(c(), type=arrow::null()))
> #> Array
> #> 
> #> [
> #>   false,
> #>   false,
> #>   false
> #> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17576) conda r-arrow Linux package has

2022-08-31 Thread Hans-Martin von Gaudecker (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans-Martin von Gaudecker updated ARROW-17576:
--
Description: 
I need to read parquet files in R using a conda environment. Works great on 
Windows, but on Linux, r-arrow comes without some core features. If this is 
expected, it would be great to flag it in the docs, at least for me, reading 
through 
[https://arrow.apache.org/docs/r/articles/install.html#method-1a---binary-r-package-containing-libarrow-binary-via-rspmconda]
 gives the impression that I need not worry about features (though pinning 
r-arrow to 8.0.1 gives a complete version lacking only lzo-support).

After creating an environment based on the attached specification:

{{{}(test-r-arrow) x@x/test-r-arrow$ R{}}}{{{}R version 4.1.3 (2022-03-10) – 
"One Push-Up"{}}}
{{Copyright (C) 2022 The R Foundation for Statistical Computing}}
{{Platform: x86_64-conda-linux-gnu (64-bit)}}

 

{{R is free software and comes with ABSOLUTELY NO WARRANTY.}}
{{You are welcome to redistribute it under certain conditions.}}
{{{}Type 'license()' or 'licence()' for distribution details.{}}}{\{  }}

 

{{Natural language support but running in an English locale}}

 

{{R is a collaborative project with many contributors.}}
{{Type 'contributors()' for more information and}}
{{'citation()' on how to cite R or R packages in publications.}}

 

{{Type 'demo()' for some demos, 'help()' for on-line help, or}}
{{'help.start()' for an HTML browser interface to help.}}
{{Type 'q()' to quit R.}}

 

{{> library(arrow)}}
{{Some features are not enabled in this build of Arrow. Run `arrow_info()` for 
more information.}}

{{Attaching package: ‘arrow’}}

{{The following object is masked from ‘package:utils’:}}{{    timestamp}}

 

{{> arrow_info()}}
{{Arrow package version: 9.0.0}}

{{Capabilities:}}
{{               }}
{{dataset   FALSE}}
{{substrait FALSE}}
{{parquet   FALSE}}
{{json      FALSE}}
{{s3        FALSE}}
{{gcs       FALSE}}
{{utf8proc   TRUE}}
{{re2        TRUE}}
{{snappy     TRUE}}
{{gzip       TRUE}}
{{brotli     TRUE}}
{{zstd       TRUE}}
{{lz4        TRUE}}
{{lz4_frame  TRUE}}
{{lzo       FALSE}}
{{bz2        TRUE}}
{{jemalloc   TRUE}}
{{mimalloc   TRUE}}

{{To reinstall with more optional capabilities enabled, see}}
{{   [https://arrow.apache.org/docs/r/articles/install.html]}}

 

{{Memory:}}
{{                  }}
{{Allocator jemalloc}}
{{Current    0 bytes}}
{{{}Max        0 bytes{}}}{{{}Runtime:{}}}
{{                        }}
{{SIMD Level          avx2}}
{{{}Detected SIMD Level avx2{}}}{{{}Build:{}}}
{{                                                             }}
{{C++ Library Version                                     9.0.0}}
{{C++ Compiler                                              GNU}}
{{C++ Compiler Version                                   10.4.0}}
{{Git ID               13127e16b858dda3b8299a1e435c3c0ba5934fdc}}

 

Creating the same environment on Windows produces:

 

{{> arrow_info()}}
{{Arrow package version: 9.0.0}}

{{Capabilities:}}

{{dataset    TRUE}}
{{substrait FALSE}}
{{parquet    TRUE}}
{{json       TRUE}}
{{s3         TRUE}}
{{gcs       FALSE}}
{{utf8proc   TRUE}}
{{re2        TRUE}}
{{snappy     TRUE}}
{{gzip       TRUE}}
{{brotli     TRUE}}
{{zstd       TRUE}}
{{lz4        TRUE}}
{{lz4_frame  TRUE}}
{{lzo       FALSE}}
{{bz2        TRUE}}
{{jemalloc  FALSE}}
{{mimalloc   TRUE}}

{{Arrow options():}}

{{arrow.use_threads FALSE}}

{{{}Memory:{}}}{{{}Allocator mimalloc{}}}
{{Current    0 bytes}}
{{Max        0 bytes}}

{{Runtime:}}

{{SIMD Level          avx2}}
{{Detected SIMD Level avx2}}

{{Build:}}

{{C++ Library Version          9.0.0}}
{{C++ Compiler                  MSVC}}
{{C++ Compiler Version 19.16.27048.0}}

 

  was:
I need to read parquet files in R using a conda environment. Works great on 
Windows, but on Linux, r-arrow comes without some core features. If this is 
expected, it would be great to flag it in the docs, at least for me, reading 
through 
[https://arrow.apache.org/docs/r/articles/install.html#method-1a---binary-r-package-containing-libarrow-binary-via-rspmconda]
 gives the impression that I need not worry about features (though pinning 
r-arrow to 8.0.1 gives a complete version lacking only lzo-support).

After creating an environment based on the attached specification:

{{{}(test-r-arrow) x@x/test-r-arrow$ R{}}}{{{}R version 4.1.3 (2022-03-10) – 
"One Push-Up"{}}}
{{Copyright (C) 2022 The R Foundation for Statistical Computing}}
{{Platform: x86_64-conda-linux-gnu (64-bit)}}

 

{{R is free software and comes with ABSOLUTELY NO WARRANTY.}}
{{You are welcome to redistribute it under certain conditions.}}
{{{}Type 'license()' or 'licence()' for distribution details.{}}}{\{  }}

 

{{Natural language support but running in an English locale}}

 

{{R is a collaborative project with many contributors.}}
{{Type 'contributors()' for 

[jira] [Updated] (ARROW-17576) conda r-arrow Linux package has

2022-08-31 Thread Hans-Martin von Gaudecker (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans-Martin von Gaudecker updated ARROW-17576:
--
Description: 
I need to read parquet files in R using a conda environment. Works great on 
Windows, but on Linux, r-arrow comes without some core features. If this is 
expected, it would be great to flag it in the docs, at least for me, reading 
through 
[https://arrow.apache.org/docs/r/articles/install.html#method-1a---binary-r-package-containing-libarrow-binary-via-rspmconda]
 gives the impression that I need not worry about features (though pinning 
r-arrow to 8.0.1 gives a complete version lacking only lzo-support).

After creating an environment based on the attached specification:

{{{}(test-r-arrow) x@x/test-r-arrow$ R{}}}{{{}R version 4.1.3 (2022-03-10) – 
"One Push-Up"{}}}
{{Copyright (C) 2022 The R Foundation for Statistical Computing}}
{{Platform: x86_64-conda-linux-gnu (64-bit)}}

 

{{R is free software and comes with ABSOLUTELY NO WARRANTY.}}
{{You are welcome to redistribute it under certain conditions.}}
{{{}Type 'license()' or 'licence()' for distribution details.{}}}{\{  }}

 

{{Natural language support but running in an English locale}}

 

{{R is a collaborative project with many contributors.}}
{{Type 'contributors()' for more information and}}
{{'citation()' on how to cite R or R packages in publications.}}

 

{{Type 'demo()' for some demos, 'help()' for on-line help, or}}
{{'help.start()' for an HTML browser interface to help.}}
{{Type 'q()' to quit R.}}

 

{{> library(arrow)}}
{{Some features are not enabled in this build of Arrow. Run `arrow_info()` for 
more information.}}

{{Attaching package: ‘arrow’}}

{{The following object is masked from ‘package:utils’:}}{{    timestamp}}

 

{{> arrow_info()}}
{{Arrow package version: 9.0.0}}

{{Capabilities:}}
{{               }}
{{dataset   FALSE}}
{{substrait FALSE}}
{{parquet   FALSE}}
{{json      FALSE}}
{{s3        FALSE}}
{{gcs       FALSE}}
{{utf8proc   TRUE}}
{{re2        TRUE}}
{{snappy     TRUE}}
{{gzip       TRUE}}
{{brotli     TRUE}}
{{zstd       TRUE}}
{{lz4        TRUE}}
{{lz4_frame  TRUE}}
{{lzo       FALSE}}
{{bz2        TRUE}}
{{jemalloc   TRUE}}
{{mimalloc   TRUE}}

{{To reinstall with more optional capabilities enabled, see}}
{{{}   [https://arrow.apache.org/docs/r/articles/install.html]{}}}{{{}{}}}

 

{{Memory:}}
{{                  }}
{{Allocator jemalloc}}
{{Current    0 bytes}}
{{{}Max        0 bytes{}}}{{{}Runtime:{}}}
{{                        }}
{{SIMD Level          avx2}}
{{{}Detected SIMD Level avx2{}}}{{{}Build:{}}}
{{                                                             }}
{{C++ Library Version                                     9.0.0}}
{{C++ Compiler                                              GNU}}
{{C++ Compiler Version                                   10.4.0}}
{{Git ID               13127e16b858dda3b8299a1e435c3c0ba5934fdc}}

 

Creating the same environment on Windows produces:

 

{{> arrow_info()}}
{{Arrow package version: 9.0.0}}

{{Capabilities:}}

{{dataset    TRUE}}
{{substrait FALSE}}
{{parquet    TRUE}}
{{json       TRUE}}
{{s3         TRUE}}
{{gcs       FALSE}}
{{utf8proc   TRUE}}
{{re2        TRUE}}
{{snappy     TRUE}}
{{gzip       TRUE}}
{{brotli     TRUE}}
{{zstd       TRUE}}
{{lz4        TRUE}}
{{lz4_frame  TRUE}}
{{lzo       FALSE}}
{{bz2        TRUE}}
{{jemalloc  FALSE}}
{{mimalloc   TRUE}}

{{Arrow options():}}

{{arrow.use_threads FALSE}}

{{{}Memory:{}}}{{{}Allocator mimalloc{}}}
{{Current    0 bytes}}
{{Max        0 bytes}}

{{Runtime:}}

{{SIMD Level          avx2}}
{{Detected SIMD Level avx2}}

{{Build:}}

{{C++ Library Version          9.0.0}}
{{C++ Compiler                  MSVC}}
{{C++ Compiler Version 19.16.27048.0}}

 

  was:
I need to read parquet files in R using a conda environment. Works great on 
Windows, but on Linux, r-arrow comes without some core features. If this is 
expected, it would be great to flag it in the docs, at least for me, reading 
through 
[https://arrow.apache.org/docs/r/articles/install.html#method-1a---binary-r-package-containing-libarrow-binary-via-rspmconda]
 gives the impression that I need not worry about features (though pinning 
r-arrow to 8.0.1 gives a complete version lacking only lzo-support).

After creating an environment based on the attached specification:

{{{}(test-r-arrow) x@x/test-r-arrow$ R{}}}{{{}R version 4.1.3 (2022-03-10) -- 
"One Push-Up"{}}}
{{Copyright (C) 2022 The R Foundation for Statistical Computing}}
{{Platform: x86_64-conda-linux-gnu (64-bit)}}

 

{{R is free software and comes with ABSOLUTELY NO WARRANTY.}}
{{You are welcome to redistribute it under certain conditions.}}
{{Type 'license()' or 'licence()' for distribution details.}}{{  }}

 

{{Natural language support but running in an English locale}}

 

{{R is a collaborative project with many contributors.}}
{{Type 'contributors

[jira] [Created] (ARROW-17576) conda r-arrow Linux package has

2022-08-31 Thread Hans-Martin von Gaudecker (Jira)
Hans-Martin von Gaudecker created ARROW-17576:
-

 Summary: conda r-arrow Linux package has 
 Key: ARROW-17576
 URL: https://issues.apache.org/jira/browse/ARROW-17576
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 9.0.0
 Environment: Ubuntu 20.04

conda 4.13.0   py39hf3d152e_1conda-forge

Environment created based on attached script:

# NameVersion   Build  Channel
_libgcc_mutex 0.1 conda_forgeconda-forge
_openmp_mutex 4.5   2_gnuconda-forge
_r-mutex  1.0.1   anacondar_1conda-forge
abseil-cpp20211102.0   h27087fc_1conda-forge
arrow-cpp 9.0.0   py310h893e394_0_cpuconda-forge
aws-c-cal 0.5.11   h95a6274_0conda-forge
aws-c-common  0.6.2h7f98852_0conda-forge
aws-c-event-stream0.2.7   h3541f99_13conda-forge
aws-c-io  0.10.5   hfb6a706_0conda-forge
aws-checksums 0.1.11   ha31a3da_7conda-forge
aws-sdk-cpp   1.8.186  hb4091e7_3conda-forge
binutils_impl_linux-642.36.1   h193b22a_2conda-forge
bwidget   1.9.14   ha770c72_1conda-forge
bzip2 1.0.8h7f98852_4conda-forge
c-ares1.18.1   h7f98852_0conda-forge
ca-certificates   2022.6.15ha878542_0conda-forge
cairo 1.16.0ha61ee94_1013conda-forge
curl  7.83.1   h7bff187_0conda-forge
expat 2.4.8h27087fc_0conda-forge
font-ttf-dejavu-sans-mono 2.37 hab24e00_0conda-forge
font-ttf-inconsolata  3.000h77eed37_0conda-forge
font-ttf-source-code-pro  2.038h77eed37_0conda-forge
font-ttf-ubuntu   0.83 hab24e00_0conda-forge
fontconfig2.14.0   h8e229c2_0conda-forge
fonts-conda-ecosystem 1 0conda-forge
fonts-conda-forge 1 0conda-forge
freetype  2.12.1   hca18f0e_0conda-forge
fribidi   1.0.10   h36c2ea0_0conda-forge
gcc_impl_linux-64 12.1.0  hea43390_16conda-forge
gettext   0.19.8.1  h73d1719_1008conda-forge
gflags2.2.2 he1b5a44_1004conda-forge
gfortran_impl_linux-6412.1.0  h1db8e46_16conda-forge
glog  0.6.0h6f12383_0conda-forge
graphite2 1.3.13h58526e2_1001conda-forge
grpc-cpp  1.46.3   hbd84cd8_3conda-forge
gsl   2.7  he838d99_0conda-forge
gxx_impl_linux-64 12.1.0  hea43390_16conda-forge
harfbuzz  5.1.0hf9f4e7c_0conda-forge
icu   70.1 h27087fc_0conda-forge
jpeg  9e   h166bdaf_2conda-forge
kernel-headers_linux-64   2.6.32  he073ed8_15conda-forge
keyutils  1.6.1h166bdaf_0conda-forge
krb5  1.19.3   h3790be6_0conda-forge
ld_impl_linux-64  2.36.1   hea4e1c9_2conda-forge
lerc  4.0.0h27087fc_0conda-forge
libblas   3.9.0   16_linux64_openblasconda-forge
libbrotlicommon   1.0.9h166bdaf_7conda-forge
libbrotlidec  1.0.9h166bdaf_7conda-forge
libbrotlienc  1.0.9h166bdaf_7conda-forge
libcblas  3.9.0   16_linux64_openblasconda-forge
libcrc32c 1.1.2h9c3ff4c_0conda-forge
libcurl   7.83.1   h7bff187_0conda-forge
libdeflate1.13 h166bdaf_0conda-forge
libedit   3.1.20191231 he28a2e2_2conda-forge
libev 4.33 h516909a_1conda-forge
libevent  2.1.10   h9b69904_4conda-forge
libffi3.4.2h7f98852_5conda-forge
libgcc-devel_linux-64 12.1.0  h1ec3361_16conda-forge
libgcc-ng 12.1.0  h8d9b700_16conda-forge
libgfortran-ng12.1.0  h69a702a_16conda-forge
libgfortran

[jira] [Commented] (ARROW-17568) [FlightRPC][Integration] Ensure all RPC methods are covered by integration testing

2022-08-31 Thread Kun Liu (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598210#comment-17598210
 ] 

Kun Liu commented on ARROW-17568:
-

hi [~lidavidm] 

Do we have the framework to test the RPC method and compatibility in the IT?

> [FlightRPC][Integration] Ensure all RPC methods are covered by integration 
> testing
> --
>
> Key: ARROW-17568
> URL: https://issues.apache.org/jira/browse/ARROW-17568
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC, Go, Integration, Java
>Reporter: David Li
>Priority: Major
>
> This would help catch issues like https://github.com/apache/arrow/issues/13853



--
This message was sent by Atlassian Jira
(v8.20.10#820010)