[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598742#comment-17598742 ] Micah Kornfield commented on ARROW-17459: - You would have to follow this up the stack from the previous comments. Without seeing the stack trace it is a bit hard to give guidance, but i'd guess there are few places that always expected BinaryArray/BinaryBuilder in the linked code and might down_cast, these would need to be adjusted accordingly. > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17590) Lower memory usage with filters
[ https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin updated ARROW-17590: Description: Hi, When I read a parquet file (about 23MB with 250K rows and 600 object/string columns with lots of None) with filter on a not null column for a small number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 rows 20MB). Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). When reading the same parquet file without filtering, the memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting something. The filtered column is not a partition key, which functionally works to get the correct rows. But the memory usage is quite high even when the parquet file is not really large, partitioned or not. There were some references similar to this issue, for example: [https://github.com/apache/arrow/issues/7338] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with read filtering? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't know what may be involved with respect to the parquet columnar format, and if it can be patched somehow in the Pyarrow Python code, or need to change and build the arrow C++ code. Thanks! was: Hi, When I read a parquet file (about 23mb with 250K rows and 600 object/string columns with lots of None) with filter on a not null column for a small number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 rows 20MB). Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). When reading the same parquet file without filtering, the memory usage is about the same at 900MB, and goes up to 2.3GB after to_pandas dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting something. The filtered column is not a partition key, which functionally works to get a small number of rows. But the memory usage is high when the parquet (partitioned or not) is large. There were some references related to this issue, for example: [https://github.com/apache/arrow/issues/7338] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with filters? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't know what may be involved with respect to the parquet columnar format, and if it can be patched somehow in the Pyarrow Python code, or need to change and build the arrow C++ code. Thanks! > Lower memory usage with filters > --- > > Key: ARROW-17590 > URL: https://issues.apache.org/jira/browse/ARROW-17590 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Yin >Priority: Major > > Hi, > When I read a parquet file (about 23MB with 250K rows and 600 object/string > columns with lots of None) with filter on a not null column for a small > number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB > to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 > rows 20MB). Looks like it scans/loads many rows from the parquet file. Not > only the footprint or watermark of memory usage is high, but also it seems > not releasing the memory in time (such as after GC in Python, but may get > used for subsequent read). > When reading the same parquet file without filte
[jira] [Updated] (ARROW-17590) Lower memory usage with filters
[ https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin updated ARROW-17590: Description: Hi, When I read a parquet file (about 23mb with 250K rows and 600 object/string columns with lots of None) with filter on a not null column for a small number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 rows 20MB). Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). When reading the same parquet file without filtering, the memory usage is about the same at 900MB, and goes up to 2.3GB after to_pandas dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting something. The filtered column is not a partition key, which functionally works to get a small number of rows. But the memory usage is high when the parquet (partitioned or not) is large. There were some references related to this issue, for example: [https://github.com/apache/arrow/issues/7338] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with filters? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't know what may be involved with respect to the parquet columnar format, and if it can be patched somehow in the Pyarrow Python code, or need to change and build the arrow C++ code. Thanks! was: Hi, When I read a parquet file (about 23mb with 250K rows and 600 object/string columns with lots of None) with filter on a not null column for a small number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 rows 20MB). Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). The filtered column is not a partition key, which functionally works to get a small number of rows. But the memory usage is high when the parquet (partitioned or not) is large. There were some references related to this issue, for example: [https://github.com/apache/arrow/issues/7338] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with filters? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't know what may be involved with respect to the parquet columnar format, and if it can be patched somehow in the Pyarrow Python code, or need to change and build the arrow C++ code. Thanks! > Lower memory usage with filters > --- > > Key: ARROW-17590 > URL: https://issues.apache.org/jira/browse/ARROW-17590 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Yin >Priority: Major > > Hi, > When I read a parquet file (about 23mb with 250K rows and 600 object/string > columns with lots of None) with filter on a not null column for a small > number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB > to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 > rows 20MB). Looks like it scans/loads many rows from the parquet file. Not > only the footprint or watermark of memory usage is high, but also it seems > not releasing the memory in time (such as after GC in Python, but may get > used for subsequent read). > When reading the same parquet file without filtering, the memory usage is > about the same at 900MB, and goes up to 2.3GB after to_pandas dataframe,. > df.info(memory_usage='deep') shows 4.3GB maybe double counting something. > The filtered column is not a partition key, which functionally works
[jira] [Updated] (ARROW-17590) Lower memory usage with filters
[ https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin updated ARROW-17590: Description: Hi, When I read a parquet file (about 23mb with 250K rows and 600 object/string columns with lots of None) with filter on a not null column for a small number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 rows 20MB). Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). The filtered column is not a partition key, which functionally works to get a small number of rows. But the memory usage is high when the parquet (partitioned or not) is large. There were some references related to this issue, for example: [https://github.com/apache/arrow/issues/7338] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with filters? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't know what may be involved with respect to the parquet columnar format, and if it can be patched somehow in the Pyarrow Python code, or need to change and build the arrow C++ code. Thanks! was: Hi, When I read a parquet file (about 23mb with 250K rows and 600 object/string columns with lots of None) with filter on a not null column for a small number of rows (e.g. 1 to 500), the memory usage is pretty high (close to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 rows 20MB). Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). The filtered column is not a partition key, which functionally works to get a small number of rows. But the memory usage is high when the parquet (partitioned or not) is large. There were some references related to this issue, for example: [https://github.com/apache/arrow/issues/7338] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with filters? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't know what may be involved with respect to the parquet columnar format, and if it can be patched somehow in the Pyarrow Python code, or need to change and build the arrow C++ code. Thanks! > Lower memory usage with filters > --- > > Key: ARROW-17590 > URL: https://issues.apache.org/jira/browse/ARROW-17590 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Yin >Priority: Major > > Hi, > When I read a parquet file (about 23mb with 250K rows and 600 object/string > columns with lots of None) with filter on a not null column for a small > number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB > to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 > rows 20MB). Looks like it scans/loads many rows from the parquet file. Not > only the footprint or watermark of memory usage is high, but also it seems > not releasing the memory in time (such as after GC in Python, but may get > used for subsequent read). > The filtered column is not a partition key, which functionally works to get a > small number of rows. But the memory usage is high when the parquet > (partitioned or not) is large. There were some references related to this > issue, for example: [https://github.com/apache/arrow/issues/7338] > Related classes/methods in (pyarrow 9.0.0) > _ParquetDatasetV2.read > self._dataset.to_table(columns=columns, filter=self._filter_expression, > use_threads=use_threads) > pyarrow._dataset.FileSystemDatase.to_table > I played wit
[jira] [Updated] (ARROW-17590) Lower memory usage with filters
[ https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin updated ARROW-17590: Description: Hi, When I read a parquet file (about 23mb with 250K rows and 600 object/string columns with lots of None) with filter on a not null column for a small number of rows (e.g. 1 to 500), the memory usage is pretty high (close to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 rows 20MB). Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). The filtered column is not a partition key, which functionally works to get a small number of rows. But the memory usage is high when the parquet (partitioned or not) is large. There were some references related to this issue, for example: [https://github.com/apache/arrow/issues/7338] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with filters? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't know what may be involved with respect to the parquet columnar format, and if it can be patched somehow in the Pyarrow Python code, or need to change and build the arrow C++ code. Thanks! was: Hi, When I read a parquet file (about 23mb with 250K rows and 600 object/string columns with lots of None) with filter for a small number of rows (e.g. 1 to 500), the memory usage is pretty high (close to 1GB). The result table and dataframe have only a few rows (at about 20MB). Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). The filtered column is not a partition key, which functionally works to get a small number of rows. But the memory usage is high when the parquet (partitioned or not) is large. There were some references related to this issue, for example: [https://github.com/apache/arrow/issues/7338] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with filters? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't know what may be involved with respect to the parquet columnar format, and if it can be patched somehow in the Pyarrow Python code, or need to change and build the arrow C++ code. Thanks! > Lower memory usage with filters > --- > > Key: ARROW-17590 > URL: https://issues.apache.org/jira/browse/ARROW-17590 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Yin >Priority: Major > > Hi, > When I read a parquet file (about 23mb with 250K rows and 600 object/string > columns with lots of None) with filter on a not null column for a small > number of rows (e.g. 1 to 500), the memory usage is pretty high (close to > 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 > rows 20MB). Looks like it scans/loads many rows from the parquet file. Not > only the footprint or watermark of memory usage is high, but also it seems > not releasing the memory in time (such as after GC in Python, but may get > used for subsequent read). > The filtered column is not a partition key, which functionally works to get a > small number of rows. But the memory usage is high when the parquet > (partitioned or not) is large. There were some references related to this > issue, for example: [https://github.com/apache/arrow/issues/7338] > Related classes/methods in (pyarrow 9.0.0) > _ParquetDatasetV2.read > self._dataset.to_table(columns=columns, filter=self._filter_expression, > use_threads=use_threads) > pyarrow._dataset.FileSystemDatase.to_table > I played with pyarrow._dataset.Scanner.to_table > self._
[jira] [Updated] (ARROW-17590) Lower memory usage with filters
[ https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin updated ARROW-17590: Description: Hi, When I read a parquet file (about 23mb with 250K rows and 600 object/string columns with lots of None) with filter for a small number of rows (e.g. 1 to 500), the memory usage is pretty high (close to 1GB). The result table and dataframe have only a few rows (at about 20MB). Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). The filtered column is not a partition key, which functionally works to get a small number of rows. But the memory usage is high when the parquet (partitioned or not) is large. There were some references related to this issue, for example: [https://github.com/apache/arrow/issues/7338] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with filters? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't know what may be involved with respect to the parquet columnar format, and if it can be patched somehow in the Pyarrow Python code, or need to change and build the arrow C++ code. Thanks! was: Hi, When I read a large parquet file with filter for a small number of rows, the memory usage is pretty high. The result table and dataframe have only a few rows. Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). The filtered column is not a partition key, which functionally works to get a small number of rows. But the memory usage is high when the parquet (partitioned or not) is large. There were some references related to this issue, for example: [https://github.com/apache/arrow/issues/7338] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with filters? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't know what may be involved with respect to the parquet columnar format, and if it can be patched somehow in the Pyarrow Python code, or need to change and build the arrow C++ code. Thanks! > Lower memory usage with filters > --- > > Key: ARROW-17590 > URL: https://issues.apache.org/jira/browse/ARROW-17590 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Yin >Priority: Major > > Hi, > When I read a parquet file (about 23mb with 250K rows and 600 object/string > columns with lots of None) with filter for a small number of rows (e.g. 1 to > 500), the memory usage is pretty high (close to 1GB). The result table and > dataframe have only a few rows (at about 20MB). Looks like it scans/loads > many rows from the parquet file. Not only the footprint or watermark of > memory usage is high, but also it seems not releasing the memory in time > (such as after GC in Python, but may get used for subsequent read). > The filtered column is not a partition key, which functionally works to get a > small number of rows. But the memory usage is high when the parquet > (partitioned or not) is large. There were some references related to this > issue, for example: [https://github.com/apache/arrow/issues/7338] > Related classes/methods in (pyarrow 9.0.0) > _ParquetDatasetV2.read > self._dataset.to_table(columns=columns, filter=self._filter_expression, > use_threads=use_threads) > pyarrow._dataset.FileSystemDatase.to_table > I played with pyarrow._dataset.Scanner.to_table > self._dataset.scanner(columns=columns, > filter=self._filter_expression).to_table() > The memory usage is small to construct the scanner but then goes up after the > to_table call materializes
[jira] [Closed] (ARROW-17591) Arrow is NOT working with Java 17
[ https://issues.apache.org/jira/browse/ARROW-17591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] evan closed ARROW-17591. Resolution: Not A Problem > Arrow is NOT working with Java 17 > - > > Key: ARROW-17591 > URL: https://issues.apache.org/jira/browse/ARROW-17591 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 9.0.0 >Reporter: evan >Priority: Major > > h2. Running environment: > * OS: mac monterey > * Language: Java > * Java version: 17.0.2 > * Arrow version: 9.0.0 > h2. Issue: > > {code:java} > java.lang.NoClassDefFoundError: Could not initialize class > org.apache.arrow.memory.util.MemoryUtil > at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1175) > at > org.apache.arrow.vector.BaseFixedWidthVector.initValidityBuffer(BaseFixedWidthVector.java:216) > at > org.apache.arrow.vector.BaseFixedWidthVector.zeroVector(BaseFixedWidthVector.java:210) > at > org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:342) > at > org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:309) > at > org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:274) > {code} > h2. Things have done > * adding --add-opens java.base/java.nio=ALL-UNNAMED to jvm args -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17590) Lower memory usage with filters
[ https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin updated ARROW-17590: Description: Hi, When I read a large parquet file with filter for a small number of rows, the memory usage is pretty high. The result table and dataframe have only a few rows. Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). The filtered column is not a partition key, which functionally works to get a small number of rows. But the memory usage is high when the parquet (partitioned or not) is large. There were some references related to this issue, for example: [https://github.com/apache/arrow/issues/7338] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with filters? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't know what may be involved with respect to the parquet columnar format, and if it can be patched somehow in the Pyarrow Python code, or need to change and build the arrow C++ code. Thanks! was: Hi, When I read a large parquet file with filter for a small number of rows, the memory usage is pretty high. The result table and dataframe have only a few rows. Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). The filtered column is not a partition key, which functionally works to get a small number of rows. But the memory usage is high when the parquet (partitioned or not) is large. There were some references related to this issue, for example: [https://github.com/apache/arrow/issues/7338] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with filters? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't see if it can be patched in the Python code. Thanks! > Lower memory usage with filters > --- > > Key: ARROW-17590 > URL: https://issues.apache.org/jira/browse/ARROW-17590 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Yin >Priority: Major > > Hi, > When I read a large parquet file with filter for a small number of rows, the > memory usage is pretty high. The result table and dataframe have only a few > rows. Looks like it scans/loads many rows from the parquet file. Not only the > footprint or watermark of memory usage is high, but also it seems not > releasing the memory in time (such as after GC in Python, but may get used > for subsequent read). > The filtered column is not a partition key, which functionally works to get a > small number of rows. But the memory usage is high when the parquet > (partitioned or not) is large. There were some references related to this > issue, for example: [https://github.com/apache/arrow/issues/7338] > Related classes/methods in (pyarrow 9.0.0) > _ParquetDatasetV2.read > self._dataset.to_table(columns=columns, filter=self._filter_expression, > use_threads=use_threads) > pyarrow._dataset.FileSystemDatase.to_table > I played with pyarrow._dataset.Scanner.to_table > self._dataset.scanner(columns=columns, > filter=self._filter_expression).to_table() > The memory usage is small to construct the scanner but then goes up after the > to_table call materializes it. > Is there some way or workaround to reduce the memory usage with filters? > If not supported yet, can it be fixed/improved with priority? > This is a blocking issue for us. I don't know what may be involved with > respect to the parquet columnar format, and if it can be patched somehow in > the Pyarrow Python code, or need to change and build the arrow C++ code.
[jira] [Updated] (ARROW-17591) Arrow is NOT working with Java 17
[ https://issues.apache.org/jira/browse/ARROW-17591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] evan updated ARROW-17591: - Description: h2. Running environment: * OS: mac monterey * Language: Java * Java version: 17.0.2 * Arrow version: 9.0.0 h2. Issue: {code:java} java.lang.NoClassDefFoundError: Could not initialize class org.apache.arrow.memory.util.MemoryUtil at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1175) at org.apache.arrow.vector.BaseFixedWidthVector.initValidityBuffer(BaseFixedWidthVector.java:216) at org.apache.arrow.vector.BaseFixedWidthVector.zeroVector(BaseFixedWidthVector.java:210) at org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:342) at org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:309) at org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:274) {code} h2. Things have done * adding --add-opens java.base/java.nio=ALL-UNNAMED to jvm args was: h2. Running environment: * OS: mac monterey * Language: Java * Java version: 17.0.2 * Arrow version: 9.0.0 h2. Issue: {code:java} java.lang.NoClassDefFoundError: Could not initialize class org.apache.arrow.memory.util.MemoryUtil at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1175) at org.apache.arrow.vector.BaseFixedWidthVector.initValidityBuffer(BaseFixedWidthVector.java:216) at org.apache.arrow.vector.BaseFixedWidthVector.zeroVector(BaseFixedWidthVector.java:210) at org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:342) at org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:309) at org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:274) at com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:65) at com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:53) {code} h2. Things have done * adding --add-opens java.base/java.nio=ALL-UNNAMED to jvm args > Arrow is NOT working with Java 17 > - > > Key: ARROW-17591 > URL: https://issues.apache.org/jira/browse/ARROW-17591 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 9.0.0 >Reporter: evan >Priority: Major > > h2. Running environment: > * OS: mac monterey > * Language: Java > * Java version: 17.0.2 > * Arrow version: 9.0.0 > h2. Issue: > > {code:java} > java.lang.NoClassDefFoundError: Could not initialize class > org.apache.arrow.memory.util.MemoryUtil > at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1175) > at > org.apache.arrow.vector.BaseFixedWidthVector.initValidityBuffer(BaseFixedWidthVector.java:216) > at > org.apache.arrow.vector.BaseFixedWidthVector.zeroVector(BaseFixedWidthVector.java:210) > at > org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:342) > at > org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:309) > at > org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:274) > {code} > h2. Things have done > * adding --add-opens java.base/java.nio=ALL-UNNAMED to jvm args -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17591) Arrow is NOT working with Java 17
[ https://issues.apache.org/jira/browse/ARROW-17591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] evan updated ARROW-17591: - Description: h2. Running environment: * OS: mac monterey * Language: Java * Java version: 17.0.2 * Arrow version: 9.0.0 h2. Issue: {code:java} java.lang.NoClassDefFoundError: Could not initialize class org.apache.arrow.memory.util.MemoryUtil at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1175) at org.apache.arrow.vector.BaseFixedWidthVector.initValidityBuffer(BaseFixedWidthVector.java:216) at org.apache.arrow.vector.BaseFixedWidthVector.zeroVector(BaseFixedWidthVector.java:210) at org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:342) at org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:309) at org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:274) at com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:65) at com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:53) {code} h2. Things have done * adding --add-opens java.base/java.nio=ALL-UNNAMED to jvm args was: ## Running environment: * OS: mac monterey * Language: Java * Java version: 17.0.2 * Arrow version: 9.0.0 ## Issue: {code:java} java.lang.NoClassDefFoundError: Could not initialize class org.apache.arrow.memory.util.MemoryUtil at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1175) at org.apache.arrow.vector.BaseFixedWidthVector.initValidityBuffer(BaseFixedWidthVector.java:216) at org.apache.arrow.vector.BaseFixedWidthVector.zeroVector(BaseFixedWidthVector.java:210) at org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:342) at org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:309) at org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:274) at com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:65) at com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:53) {code} ## Things have done to try to fix it: 1. adding --add-opens java.base/java.nio=ALL-UNNAMED to jvm args > Arrow is NOT working with Java 17 > - > > Key: ARROW-17591 > URL: https://issues.apache.org/jira/browse/ARROW-17591 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 9.0.0 >Reporter: evan >Priority: Major > > h2. Running environment: > * OS: mac monterey > * Language: Java > * Java version: 17.0.2 > * Arrow version: 9.0.0 > h2. Issue: > > {code:java} > java.lang.NoClassDefFoundError: Could not initialize class > org.apache.arrow.memory.util.MemoryUtil > at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1175) > at > org.apache.arrow.vector.BaseFixedWidthVector.initValidityBuffer(BaseFixedWidthVector.java:216) > at > org.apache.arrow.vector.BaseFixedWidthVector.zeroVector(BaseFixedWidthVector.java:210) > at > org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:342) > at > org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:309) > at > org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:274) > at > com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:65) > at > com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:53) > {code} > > h2. Things have done > * adding --add-opens java.base/java.nio=ALL-UNNAMED to jvm args -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17591) Arrow is NOT working with Java 17
evan created ARROW-17591: Summary: Arrow is NOT working with Java 17 Key: ARROW-17591 URL: https://issues.apache.org/jira/browse/ARROW-17591 Project: Apache Arrow Issue Type: Bug Components: Java Affects Versions: 9.0.0 Reporter: evan ## Running environment: * OS: mac monterey * Language: Java * Java version: 17.0.2 * Arrow version: 9.0.0 ## Issue: {code:java} java.lang.NoClassDefFoundError: Could not initialize class org.apache.arrow.memory.util.MemoryUtil at org.apache.arrow.memory.ArrowBuf.setZero(ArrowBuf.java:1175) at org.apache.arrow.vector.BaseFixedWidthVector.initValidityBuffer(BaseFixedWidthVector.java:216) at org.apache.arrow.vector.BaseFixedWidthVector.zeroVector(BaseFixedWidthVector.java:210) at org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:342) at org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:309) at org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:274) at com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:65) at com.thingworks.jarvis.persistent.memory.data.array.arrow.VarTextArrayDataStream.(VarTextArrayDataStream.java:53) {code} ## Things have done to try to fix it: 1. adding --add-opens java.base/java.nio=ALL-UNNAMED to jvm args -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17590) Lower memory usage with filters
[ https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin updated ARROW-17590: Description: Hi, When I read a large parquet file with filter for a small number of rows, the memory usage is pretty high. The result table and dataframe have only a few rows. Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). The filtered column is not a partition key, which functionally works to get a small number of rows. But the memory usage is high when the parquet (partitioned or not) is large. There were some references related to this issue, for example: [https://github.com/apache/arrow/issues/7338] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with filters? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't see if it can be patched in the Python code. Thanks! was: Hi, When I read a large parquet file with filter for a small number of rows, the memory usage is pretty high. The result table and dataframe have only a few rows. Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). The filtered column is not a partition key, which functionally works to get a small number of rows. But the memory usage is high when the parquet (partitioned or not) is large. There were some references related to this issue, for example: [https://github.com/apache/arrow/issues/7338|https://github.com/apache/arrow/issues/7338.] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with filters? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't see if it can be patched in the Python code. Thanks! > Lower memory usage with filters > --- > > Key: ARROW-17590 > URL: https://issues.apache.org/jira/browse/ARROW-17590 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Yin >Priority: Major > > Hi, > When I read a large parquet file with filter for a small number of rows, the > memory usage is pretty high. The result table and dataframe have only a few > rows. Looks like it scans/loads many rows from the parquet file. Not only the > footprint or watermark of memory usage is high, but also it seems not > releasing the memory in time (such as after GC in Python, but may get used > for subsequent read). > The filtered column is not a partition key, which functionally works to get a > small number of rows. But the memory usage is high when the parquet > (partitioned or not) is large. There were some references related to this > issue, for example: [https://github.com/apache/arrow/issues/7338] > Related classes/methods in (pyarrow 9.0.0) > _ParquetDatasetV2.read > self._dataset.to_table(columns=columns, filter=self._filter_expression, > use_threads=use_threads) > pyarrow._dataset.FileSystemDatase.to_table > I played with pyarrow._dataset.Scanner.to_table > self._dataset.scanner(columns=columns, > filter=self._filter_expression).to_table() > The memory usage is small to construct the scanner but then goes up after the > to_table call materializes it. > Is there some way or workaround to reduce the memory usage with filters? > If not supported yet, can it be fixed/improved with priority? > This is a blocking issue for us. I don't see if it can be patched in the > Python code. > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17590) Lower memory usage with filters
[ https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin updated ARROW-17590: Description: Hi, When I read a large parquet file with filter for a small number of rows, the memory usage is pretty high. The result table and dataframe have only a few rows. Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). The filtered column is not a partition key, which functionally works to get a small number of rows. But the memory usage is high when the parquet (partitioned or not) is large. There were some references related to this issue, for example: [https://github.com/apache/arrow/issues/7338|https://github.com/apache/arrow/issues/7338.] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with filters? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't see if it can be patched in the Python code. Thanks! was: Hi, When I read a large parquet file with filter for a small number of rows, the memory usage is pretty high. The result table and dataframe have only a few rows. Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). The filtered column is not a partition key, which functionally works to get a small number of rows. But the memory usage is high when the parquet (partitioned or not) is large. There were some references related to this issue, for example: [https://github.com/apache/arrow/issues/7338.] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with filters? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't see if it can be patched in the Python code. Thanks! > Lower memory usage with filters > --- > > Key: ARROW-17590 > URL: https://issues.apache.org/jira/browse/ARROW-17590 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Yin >Priority: Major > > Hi, > When I read a large parquet file with filter for a small number of rows, the > memory usage is pretty high. The result table and dataframe have only a few > rows. Looks like it scans/loads many rows from the parquet file. Not only the > footprint or watermark of memory usage is high, but also it seems not > releasing the memory in time (such as after GC in Python, but may get used > for subsequent read). > The filtered column is not a partition key, which functionally works to get a > small number of rows. But the memory usage is high when the parquet > (partitioned or not) is large. There were some references related to this > issue, for example: > [https://github.com/apache/arrow/issues/7338|https://github.com/apache/arrow/issues/7338.] > Related classes/methods in (pyarrow 9.0.0) > _ParquetDatasetV2.read > self._dataset.to_table(columns=columns, filter=self._filter_expression, > use_threads=use_threads) > pyarrow._dataset.FileSystemDatase.to_table > I played with pyarrow._dataset.Scanner.to_table > self._dataset.scanner(columns=columns, > filter=self._filter_expression).to_table() > The memory usage is small to construct the scanner but then goes up after the > to_table call materializes it. > Is there some way or workaround to reduce the memory usage with filters? > If not supported yet, can it be fixed/improved with priority? > This is a blocking issue for us. I don't see if it can be patched in the > Python code. > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17590) Lower memory usage with filters
Yin created ARROW-17590: --- Summary: Lower memory usage with filters Key: ARROW-17590 URL: https://issues.apache.org/jira/browse/ARROW-17590 Project: Apache Arrow Issue Type: Improvement Reporter: Yin Hi, When I read a large parquet file with filter for a small number of rows, the memory usage is pretty high. The result table and dataframe have only a few rows. Looks like it scans/loads many rows from the parquet file. Not only the footprint or watermark of memory usage is high, but also it seems not releasing the memory in time (such as after GC in Python, but may get used for subsequent read). The filtered column is not a partition key, which functionally works to get a small number of rows. But the memory usage is high when the parquet (partitioned or not) is large. There were some references related to this issue, for example: [https://github.com/apache/arrow/issues/7338.] Related classes/methods in (pyarrow 9.0.0) _ParquetDatasetV2.read self._dataset.to_table(columns=columns, filter=self._filter_expression, use_threads=use_threads) pyarrow._dataset.FileSystemDatase.to_table I played with pyarrow._dataset.Scanner.to_table self._dataset.scanner(columns=columns, filter=self._filter_expression).to_table() The memory usage is small to construct the scanner but then goes up after the to_table call materializes it. Is there some way or workaround to reduce the memory usage with filters? If not supported yet, can it be fixed/improved with priority? This is a blocking issue for us. I don't see if it can be patched in the Python code. Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17580) [Doc][C++][Python] Unclear how to influence compilation flags
[ https://issues.apache.org/jira/browse/ARROW-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598685#comment-17598685 ] Kouhei Sutou commented on ARROW-17580: -- {quote} For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while the {{CXXFLAGS}} environment variable is ignored (it probably shouldn't?). {quote} Really? {{CXXFLAGS}} environment variable should be respected because we don't override {{CMAKE_CXX_FLAGS}} that is initialized with {{CXXFLAGS}} environment variable. See also: * https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L555 * https://cmake.org/cmake/help/latest/envvar/CXXFLAGS.html For PyArrow: * We should unify {{python/CMakeLists.txt}} and {{python/pyarrow/src/CMakeLists.txt}}. * We need to look into why {{CXXFLAGS}} environment variable is ignored. > [Doc][C++][Python] Unclear how to influence compilation flags > - > > Key: ARROW-17580 > URL: https://issues.apache.org/jira/browse/ARROW-17580 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Python >Reporter: Antoine Pitrou >Priority: Critical > > Frequently people need to customize compilation flags for C++ and/or C files. > Unfortunately, both for Arrow C++ and PyArrow, it is very difficult to find > out the proper way to do this. > For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while > the {{CXXFLAGS}} environment variable is ignored (it probably shouldn't?). > For PyArrow, I have not found a way to do it. The {{CXXFLAGS}} environment > variable is ignored, and the {{PYARROW_CXXFLAGS}} CMake variable has two > problems: > * it is only recognized for Cython-generated files, not for PyArrow C++ > sources > * it only affects linker calls, while it should actually affect compiler calls -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17578) [CI] Nightly test-r-gcc-12 fails to build
[ https://issues.apache.org/jira/browse/ARROW-17578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598676#comment-17598676 ] Kouhei Sutou commented on ARROW-17578: -- We need to focus on the following message: https://github.com/ursacomputing/crossbow/runs/8104457062?check_suite_focus=true#step:5:1897 {noformat} #4 9.317 The following packages have unmet dependencies: #4 9.411 python3-dev : Depends: python3-distutils (>= 3.10.6-1~) but 3.10.4-0ubuntu1 is to be installed #4 9.415 E: Unable to correct problems, you have held broken packages. {noformat} This is caused by {{ppa:ubuntu-toolchain-r/volatile}} https://github.com/apache/arrow/blob/master/ci/docker/ubuntu-22.04-cpp.dockerfile#L118 . It provides {{python3-dev=3.10.6-1~22.04}} but doesn't provide {{python3-distutils=3.10.6-1~22.04}}. It seems that Ubuntu 22.04 provides {{gcc-12}}: https://packages.ubuntu.com/search?keywords=gcc-12 How about using the system {{gcc-12}} instead of {{gcc-12}} provided by {{ppa:ubuntu-toolchain-r/volatile}}? {noformat} diff --git a/ci/docker/ubuntu-22.04-cpp.dockerfile b/ci/docker/ubuntu-22.04-cpp.dockerfile index 05aca53151..514e314c40 100644 --- a/ci/docker/ubuntu-22.04-cpp.dockerfile +++ b/ci/docker/ubuntu-22.04-cpp.dockerfile @@ -112,7 +112,7 @@ RUN if [ "${gcc_version}" = "" ]; then \ g++ \ gcc; \ else \ - if [ "${gcc_version}" -gt "11" ]; then \ + if [ "${gcc_version}" -gt "12" ]; then \ apt-get update -y -q && \ apt-get install -y -q --no-install-recommends software-properties-common && \ add-apt-repository ppa:ubuntu-toolchain-r/volatile; \ {noformat} > [CI] Nightly test-r-gcc-12 fails to build > - > > Key: ARROW-17578 > URL: https://issues.apache.org/jira/browse/ARROW-17578 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Raúl Cumplido >Priority: Major > Labels: Nightly > > [test-r-gcc-12|https://github.com/ursacomputing/crossbow/runs/8104457062?check_suite_focus=true] > has been failing to build since the 18th of August. The current error log is: > {code:java} > #4 ERROR: executor failed running [/bin/bash -o pipefail -c apt-get update > -y && apt-get install -y dirmngr apt-transport-https > software-properties-common && wget -qO- > https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | > tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc && add-apt-repository > 'deb https://cloud.r-project.org/bin/linux/ubuntu '$(lsb_release > -cs)'-cran40/' && apt-get install -y r-base=${r}* > r-recommended=${r}* libxml2-dev libgit2-dev > libssl-dev clang clang-format clang-tidy > texlive-latex-base locales python3 python3-pip > python3-dev && locale-gen en_US.UTF-8 && apt-get clean && rm -rf > /var/lib/apt/lists/*]: exit code: 100 > -- > > [ 2/17] RUN apt-get update -y && apt-get install -y dirmngr > apt-transport-https software-properties-common && wget -qO- > https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | > tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc && add-apt-repository > 'deb https://cloud.r-project.org/bin/linux/ubuntu '$(lsb_release > -cs)'-cran40/' && apt-get install -y r-base=4.2* > r-recommended=4.2* libxml2-dev libgit2-dev libssl-dev > clang clang-format clang-tidy > texlive-latex-base locales python3 python3-pip > python3-dev && locale-gen en_US.UTF-8 && apt-get clean && rm -rf > /var/lib/apt/lists/*: > -- > executor failed running [/bin/bash -o pipefail -c apt-get update -y && > apt-get install -y dirmngr apt-transport-https > software-properties-common && wget -qO- > https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | > tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc && add-apt-repository > 'deb https://cloud.r-project.org/bin/linux/ubuntu '$(lsb_release > -cs)'-cran40/' && apt-get install -y r-base=${r}* > r-recommended=${r}* libxml2-dev libgit2-dev > libssl-dev clang clang-format clang-tidy > texlive-latex-base locales python3 python3-pip > python3-dev && locale-gen en_US.UTF-8 && apt-get clean && rm -rf > /var/lib/apt/lists/*]: exit code: 100 > Service 'ubuntu-r-only-r' failed to build : Build failed {code} > I can't reproduce locally (I get a different error) but could it be that w
[jira] [Resolved] (ARROW-17079) [C++] Improve error message propagation from AWS SDK
[ https://issues.apache.org/jira/browse/ARROW-17079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philipp Moritz resolved ARROW-17079. Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 14001 [https://github.com/apache/arrow/pull/14001] > [C++] Improve error message propagation from AWS SDK > > > Key: ARROW-17079 > URL: https://issues.apache.org/jira/browse/ARROW-17079 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 8.0.0 >Reporter: Philipp Moritz >Assignee: Philipp Moritz >Priority: Minor > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > Dear all, > I'd like to see if there is interest to improve the error messages that > originate from the AWS SDK. Especially for loading datasets from S3, there > are many things that can go wrong and the error messages that (Py)Arrow gives > are not always the most actionable, especially if the call involves many > different SDK functions. In particular, it would be great to have the > following attached to each error message: > * A machine parseable status code from the AWS SDK > * Information as to exactly which AWS SDK call failed, so it can be > disambiguated for Arrow API calls that use multiple AWS SDK calls > In the ideal case, as a developer I could reconstruct the AWS SDK call that > failed from the error message (e.g. in a form the allows me to run the API > call via the "aws" CLI program) so I can debug errors and see how they relate > to my AWS infrastructure. Any progress in this direction would be super > helpful. > > For context: I recently was debugging some permissioning issues in S3 based > on the current error codes and it was pretty hard to figure out what was > going on (see > [https://github.com/ray-project/ray/issues/19799#issuecomment-1185035602).] > > I'm happy to take a stab at this problem but might need some help. Is > implementing a custom StatusDetail class for AWS errors and propagating > errors that way the right hunch here? > [https://github.com/apache/arrow/blob/50f6fcad6cc09c06e78dcd09ad07218b86e689de/cpp/src/arrow/status.h#L110] > > All the best, > Philipp. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-14161) [C++][Parquet][Docs] Reading/Writing Parquet Files
[ https://issues.apache.org/jira/browse/ARROW-14161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-14161: --- Labels: pull-request-available (was: ) > [C++][Parquet][Docs] Reading/Writing Parquet Files > -- > > Key: ARROW-14161 > URL: https://issues.apache.org/jira/browse/ARROW-14161 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Rares Vernica >Assignee: Will Jones >Priority: Minor > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Missing documentation on Reading/Writing Parquet files C++ api: > * > [WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE] > missing docs on chunk_size found some > [here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53] > _size of the RowGroup in the parquet file. Normally you would choose this to > be rather large_ > * Typo in file reader > [example|https://arrow.apache.org/docs/cpp/parquet.html#filereader] the > include should be {{#include "parquet/arrow/reader.h"}} > * > [WriteProperties/Builder|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet16WriterPropertiesE] > missing docs on {{compression}} > * Missing example on using WriteProperties -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17589) [Docs] Clarify that C data interface is not strictly for within a single process
[ https://issues.apache.org/jira/browse/ARROW-17589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17589: --- Labels: pull-request-available (was: ) > [Docs] Clarify that C data interface is not strictly for within a single > process > > > Key: ARROW-17589 > URL: https://issues.apache.org/jira/browse/ARROW-17589 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Affects Versions: 9.0.0 >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The Arrow C data interface and C stream interface docs say that these > interfaces are only for sharing data within a single process. This is not > exactly correct. The fundamental requirement for using the C data interface > and C stream interface is not that the components be necessarily running in > the same process. It’s that they are able to pass around a pointer and access > the same memory address space. Let's make some minor changes to the docs to > clarify this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17589) [Docs] Clarify that C data interface is not strictly for within a single process
Ian Cook created ARROW-17589: Summary: [Docs] Clarify that C data interface is not strictly for within a single process Key: ARROW-17589 URL: https://issues.apache.org/jira/browse/ARROW-17589 Project: Apache Arrow Issue Type: Improvement Components: Documentation Affects Versions: 9.0.0 Reporter: Ian Cook Assignee: Ian Cook Fix For: 10.0.0 The Arrow C data interface and C stream interface docs say that these interfaces are only for sharing data within a single process. This is not exactly correct. The fundamental requirement for using the C data interface and C stream interface is not that the components be necessarily running in the same process. It’s that they are able to pass around a pointer and access the same memory address space. Let's make some minor changes to the docs to clarify this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17577) [C++][Python] CMake cannot find Arrow/Arrow Python when building PyArrow
[ https://issues.apache.org/jira/browse/ARROW-17577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598629#comment-17598629 ] Kouhei Sutou commented on ARROW-17577: -- Sorry for your inconvenient... I think that you have {{PKG_CONFIG_PATH=$ARROW_HOME/lib/pkgconfig}} environment variable. You need to set {{CMAKE_PREFIX_PATH=$ARROW_HOME}} environment variable instead to find {{Arrow}} and other CMake packages. I'll update related documents by ARROW-17575. > [C++][Python] CMake cannot find Arrow/Arrow Python when building PyArrow > > > Key: ARROW-17577 > URL: https://issues.apache.org/jira/browse/ARROW-17577 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Alenka Frim >Priority: Major > > When building on master yesterday the PyArrow built worked fine. Today there > is an issue with CMake unable to find packages. See: > > {code:java} > -- Finished CMake build and install for PyArrow C++ > creating /Users/alenkafrim/repos/arrow/python/build/temp.macosx-12-arm64-3.9 > -- Running cmake for PyArrow > cmake -DPYTHON_EXECUTABLE=/Users/alenkafrim/repos/pyarrow-dev-9/bin/python > -DPython3_EXECUTABLE=/Users/alenkafrim/repos/pyarrow-dev-9/bin/python > -DPYARROW_CPP_HOME=/Users/alenkafrim/repos/arrow/python/build/dist "" > -DPYARROW_BUILD_CUDA=off -DPYARROW_BUILD_SUBSTRAIT=off > -DPYARROW_BUILD_FLIGHT=on -DPYARROW_BUILD_GANDIVA=off > -DPYARROW_BUILD_DATASET=on -DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=on > -DPYARROW_BUILD_PARQUET_ENCRYPTION=off -DPYARROW_BUILD_PLASMA=off > -DPYARROW_BUILD_GCS=off -DPYARROW_BUILD_S3=on -DPYARROW_BUILD_HDFS=off > -DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off > -DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off > -DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on > -DCMAKE_BUILD_TYPE=release /Users/alenkafrim/repos/arrow/python > CMake Warning: > Ignoring empty string ("") provided on the command line. > -- The C compiler identification is AppleClang 13.1.6.13160021 > -- The CXX compiler identification is AppleClang 13.1.6.13160021 > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - done > -- Check for working C compiler: > /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc > - skipped > -- Detecting C compile features > -- Detecting C compile features - done > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - done > -- Check for working CXX compiler: > /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ > - skipped > -- Detecting CXX compile features > -- Detecting CXX compile features - done > -- System processor: arm64 > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH > -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Success > -- Arrow build warning level: PRODUCTION > -- Configured for RELEASE build (set with cmake > -DCMAKE_BUILD_TYPE={release,debug,...}) > -- Build Type: RELEASE > -- Generator: Unix Makefiles > -- Build output directory: > /Users/alenkafrim/repos/arrow/python/build/temp.macosx-12-arm64-3.9/release > -- Found Python3: /Users/alenkafrim/repos/pyarrow-dev-9/bin/python (found > version "3.9.13") found components: Interpreter Development.Module NumPy > -- Found Python3Alt: /Users/alenkafrim/repos/pyarrow-dev-9/bin/python > CMake Error at > /opt/homebrew/Cellar/cmake/3.24.1/share/cmake/Modules/CMakeFindDependencyMacro.cmake:47 > (find_package): > By not providing "FindArrow.cmake" in CMAKE_MODULE_PATH this project has > asked CMake to find a package configuration file provided by "Arrow", but > CMake did not find one. > Could not find a package configuration file provided by "Arrow" with any of > the following names: > ArrowConfig.cmake > arrow-config.cmake > Add the installation prefix of "Arrow" to CMAKE_PREFIX_PATH or set > "Arrow_DIR" to a directory containing one of the above files. If "Arrow" > provides a separate development package or SDK, be sure it has been > installed. > Call Stack (most recent call first): > build/dist/lib/cmake/ArrowPython/ArrowPythonConfig.cmake:54 > (find_dependency) > CMakeLists.txt:240 (find_package) > {code} > I did a clean built on the latest master. Am I missing some variables that > need to be set after [https://github.com/apache/arrow/pull/13892] ? > I am calling cmake with these flags: > > {code:java} > cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ > -DCMAKE_INSTALL_LIBDIR=lib \ > -DCMAKE_BUILD_TYPE=debug \ > -DARROW_WITH_BZ2=ON \ > -DARROW_WITH_ZLIB=ON \ > -DARROW_WITH_ZSTD=ON \ > -DARROW_WITH_LZ4=ON \ > -DARROW_WITH_SNAPPY=ON \ > -DARROW_WITH_B
[jira] [Updated] (ARROW-17575) [C++][Docs] Update build document to follow new CMake package
[ https://issues.apache.org/jira/browse/ARROW-17575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-17575: - Description: This is a follow-up of ARROW-12175. https://github.com/apache/arrow/blob/master/docs/source/cpp/build_system.rst should be updated. was:This is a follow-up of ARROW-12175. > [C++][Docs] Update build document to follow new CMake package > - > > Key: ARROW-17575 > URL: https://issues.apache.org/jira/browse/ARROW-17575 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > > This is a follow-up of ARROW-12175. > https://github.com/apache/arrow/blob/master/docs/source/cpp/build_system.rst > should be updated. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598581#comment-17598581 ] Arthur Passos commented on ARROW-17459: --- I am a bit lost rn. I have made some changes to use LargeBinaryBuilder, but there is always an incosistency that throws an exception. Are you aware of any place in the code where instead of taking the String path it would take the LargeString path? I went all the way back to where it reads the schema in the hope of finding a place I could change the DataType from STRING to LARGE_STRING. Couldn't do so. > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17588) [Go] Casting to BinaryLike types
Matthew Topol created ARROW-17588: - Summary: [Go] Casting to BinaryLike types Key: ARROW-17588 URL: https://issues.apache.org/jira/browse/ARROW-17588 Project: Apache Arrow Issue Type: Sub-task Components: Go Reporter: Matthew Topol Assignee: Matthew Topol -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17587) [Go] Cast from Extension Types
[ https://issues.apache.org/jira/browse/ARROW-17587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17587: --- Labels: pull-request-available (was: ) > [Go] Cast from Extension Types > -- > > Key: ARROW-17587 > URL: https://issues.apache.org/jira/browse/ARROW-17587 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17587) [Go] Cast from Extension Types
Matthew Topol created ARROW-17587: - Summary: [Go] Cast from Extension Types Key: ARROW-17587 URL: https://issues.apache.org/jira/browse/ARROW-17587 Project: Apache Arrow Issue Type: Sub-task Components: Go Reporter: Matthew Topol Assignee: Matthew Topol -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17584) [Go] Unable to build with tinygo
[ https://issues.apache.org/jira/browse/ARROW-17584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598533#comment-17598533 ] Matthew Topol commented on ARROW-17584: --- [~tschaub] So far we've been keeping the N-2 concept for support, as `unsafe.Slice` was introduced in go1.17 it would make sense for us to bump our required version to go1.17 and convert our references to the sliceheader to instead use `unsafe.Slice`. Can you check / confirm whether or not `unsafe.Slice` works correctly with TinyGo such that it would be a viable alternative? > [Go] Unable to build with tinygo > > > Key: ARROW-17584 > URL: https://issues.apache.org/jira/browse/ARROW-17584 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Reporter: Tim Schaub >Priority: Major > > I was hoping to use TinyGo to build WASM binaries with Arrow. TinyGo can > generate builds that are [1% the > size|https://tinygo.org/getting-started/overview/#:~:text=The%20only%20difference%20here%2C%20is,used%2C%20and%20the%20associated%20runtime.&text=In%20this%20case%20the%20Go,size%20(251k%20before%20stripping)!] > of those generated with Go (significant for applications hosted on the web). > Arrow's use of `reflect.SliceHeader` fields limits the portability of the > code. For example, the `Len` and `Cap` fields are assumed to be `int` here: > https://github.com/apache/arrow/blob/go/v9.0.0/go/arrow/bitutil/bitutil.go#L158-L159 > Go's [reflect package > warns|https://github.com/golang/go/blob/go1.19/src/reflect/value.go#L2675-L2685] > that the SliceHeader "cannot be used safely or portably and its > representation may change in a later release." > Attempts to build a WASM binary using the github.com/apache/arrow/go/v10 > module result in failures like this: > {code} > tinygo build -tags noasm -o test.wasm ./main.go > {code} > {code} > # github.com/apache/arrow/go/v10/arrow/bitutil > ../../go/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220831082949-cf27001da088/arrow/bitutil/bitutil.go:158:10: > invalid operation: h.Len / uint64SizeBytes (mismatched types uintptr and int) > ../../go/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220831082949-cf27001da088/arrow/bitutil/bitutil.go:159:10: > invalid operation: h.Cap / uint64SizeBytes (mismatched types uintptr and int) > {code} > This happens because TinyGo uses `uintptr` for the corresponding types: > https://github.com/tinygo-org/tinygo/blob/v0.25.0/src/reflect/value.go#L773-L777 > This feels like an issue with TinyGo, and it has been ticketed there multiple > times (see https://github.com/tinygo-org/tinygo/issues/1284). They lean on > the warnings in the Go sources that use of the SliceHeader fields makes code > unportable and suggest changes to the libraries that do not heed this warning. > I don't have a suggested fix or alternative for Arrow's use of SliceHeader > fields, but I'm wondering if there would be willingness on the part of this > package to make WASM builds work with TinyGo. Perhaps the TinyGo authors > could also offer suggested changes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17586) [Go] String to Numeric Cast functions
[ https://issues.apache.org/jira/browse/ARROW-17586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17586: --- Labels: pull-request-available (was: ) > [Go] String to Numeric Cast functions > - > > Key: ARROW-17586 > URL: https://issues.apache.org/jira/browse/ARROW-17586 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go >Reporter: Matthew Topol >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17573) [Go] Parquet ByteArray statistics cause memory leak
[ https://issues.apache.org/jira/browse/ARROW-17573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol resolved ARROW-17573. --- Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 14013 [https://github.com/apache/arrow/pull/14013] > [Go] Parquet ByteArray statistics cause memory leak > --- > > Key: ARROW-17573 > URL: https://issues.apache.org/jira/browse/ARROW-17573 > Project: Apache Arrow > Issue Type: Bug > Components: Go, Parquet >Affects Versions: 9.0.0 >Reporter: Sasha Sirovica >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > When using `arrow.BinaryTypes.String` in a schema, appending multiple > strings, and then writing a record out to parquet the memory of the program > continuously increases. This also applies for the other `arrow.BinaryTypes` > > I took a heap dump midway through the program and the majority of allocations > comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM > before terminating the program. > > I was not able to replicate this behavior with just PrimativeTypes. Another > interesting point, if the records are created but never written with pqarrow > memory does not grow. In the below program commenting out `w.Write(rec)` will > not cause memory issues. > Example program which causes memory to leak: > {code:java} > package main > import ( >"os" >"github.com/apache/arrow/go/v9/arrow" >"github.com/apache/arrow/go/v9/arrow/array" >"github.com/apache/arrow/go/v9/arrow/memory" >"github.com/apache/arrow/go/v9/parquet" >"github.com/apache/arrow/go/v9/parquet/compress" >"github.com/apache/arrow/go/v9/parquet/pqarrow" > ) > func main() { >f, _ := os.Create("/tmp/test.parquet") >arrowProps := pqarrow.DefaultWriterProps() >schema := arrow.NewSchema( > []arrow.Field{ > {Name: "aString", Type: arrow.BinaryTypes.String}, > }, > nil, >) >w, _ := pqarrow.NewFileWriter(schema, f, > parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), > arrowProps) >builder := array.NewRecordBuilder(memory.DefaultAllocator, schema) >for i := 1; i < 50; i++ { > builder.Field(0).(*array.StringBuilder).Append("HelloWorld!") > if i%200 == 0 { > // Write row groups out every 2M times > rec := builder.NewRecord() > w.Write(rec) > rec.Release() > } >} >w.Close() > }{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17586) [Go] String to Numeric Cast functions
Matthew Topol created ARROW-17586: - Summary: [Go] String to Numeric Cast functions Key: ARROW-17586 URL: https://issues.apache.org/jira/browse/ARROW-17586 Project: Apache Arrow Issue Type: Sub-task Components: Go Reporter: Matthew Topol Assignee: Matthew Topol -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-14161) [C++][Parquet][Docs] Reading/Writing Parquet Files
[ https://issues.apache.org/jira/browse/ARROW-14161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-14161: -- Assignee: Will Jones > [C++][Parquet][Docs] Reading/Writing Parquet Files > -- > > Key: ARROW-14161 > URL: https://issues.apache.org/jira/browse/ARROW-14161 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: Rares Vernica >Assignee: Will Jones >Priority: Minor > Fix For: 10.0.0 > > > Missing documentation on Reading/Writing Parquet files C++ api: > * > [WriteTable|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10WriteTableERKN5arrow5TableEP10MemoryPoolNSt10shared_ptrIN5arrow2io12OutputStreamEEE7int64_tNSt10shared_ptrI16WriterPropertiesEENSt10shared_ptrI21ArrowWriterPropertiesEE] > missing docs on chunk_size found some > [here|https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/examples/parquet-arrow/src/reader-writer.cc#L53] > _size of the RowGroup in the parquet file. Normally you would choose this to > be rather large_ > * Typo in file reader > [example|https://arrow.apache.org/docs/cpp/parquet.html#filereader] the > include should be {{#include "parquet/arrow/reader.h"}} > * > [WriteProperties/Builder|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet16WriterPropertiesE] > missing docs on {{compression}} > * Missing example on using WriteProperties -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17585) [Java] Extend types supported by GenerateSampleData
Larry White created ARROW-17585: --- Summary: [Java] Extend types supported by GenerateSampleData Key: ARROW-17585 URL: https://issues.apache.org/jira/browse/ARROW-17585 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Larry White org.apache.arrow.vector.GenerateSampleTypes does not support the Uint vector types. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-14958) [C++][FlightRPC] Enable OpenTelemetry with Arrow Flight
[ https://issues.apache.org/jira/browse/ARROW-14958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598502#comment-17598502 ] Todd Farmer commented on ARROW-14958: - This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon. > [C++][FlightRPC] Enable OpenTelemetry with Arrow Flight > --- > > Key: ARROW-14958 > URL: https://issues.apache.org/jira/browse/ARROW-14958 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Time Spent: 2h 50m > Remaining Estimate: 0h > > Sans Python support, at least for now, since figuring out how to do the > bindings will be a challenge there. Also see > [https://github.com/open-telemetry/community/discussions/734] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-14958) [C++][FlightRPC] Enable OpenTelemetry with Arrow Flight
[ https://issues.apache.org/jira/browse/ARROW-14958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Farmer reassigned ARROW-14958: --- Assignee: (was: David Li) > [C++][FlightRPC] Enable OpenTelemetry with Arrow Flight > --- > > Key: ARROW-14958 > URL: https://issues.apache.org/jira/browse/ARROW-14958 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC >Reporter: David Li >Priority: Major > Labels: pull-request-available > Time Spent: 2h 50m > Remaining Estimate: 0h > > Sans Python support, at least for now, since figuring out how to do the > bindings will be a challenge there. Also see > [https://github.com/open-telemetry/community/discussions/734] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598501#comment-17598501 ] Micah Kornfield commented on ARROW-17459: - Yes, I think there are some code changes, we hard-code non large [BinaryBuilder|https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.cc#L1693] for accumulating chunks and then used when [decoding arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1392]. To answer your questions, I don't think the second case applies. As far as I know Parquet C++ does its own chunking and doesn't try to read back the exact chunking that the values are written with. > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-6485) [Format][C++]Support the format of a COO sparse matrix that has separated row and column indices
[ https://issues.apache.org/jira/browse/ARROW-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc reassigned ARROW-6485: - Assignee: Rok Mihevc > [Format][C++]Support the format of a COO sparse matrix that has separated row > and column indices > > > Key: ARROW-6485 > URL: https://issues.apache.org/jira/browse/ARROW-6485 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format >Reporter: Kenta Murata >Assignee: Rok Mihevc >Priority: Major > Labels: pull-request-available > Time Spent: 2h 40m > Remaining Estimate: 0h > > For supporting non-copy interchanging of scipy.sparse.coo_matrix, I'd like to > add the new format of a COO matrix that has separated row and column indices. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-15787) [C++] Temporal floor/ceil/round kernels could be optimised with templating
[ https://issues.apache.org/jira/browse/ARROW-15787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc reassigned ARROW-15787: -- Assignee: Rok Mihevc > [C++] Temporal floor/ceil/round kernels could be optimised with templating > -- > > Key: ARROW-15787 > URL: https://issues.apache.org/jira/browse/ARROW-15787 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Rok Mihevc >Assignee: Rok Mihevc >Priority: Minor > Labels: kernel, pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > [CeilTemporal, FloorTemporal, > RoundTemporal|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc#L728-L980] > kernels could probably be templated in a clean way. They also execute a > switch statement for every call instead of creating an operator at kernel > call time and only running that. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-16147) [C++] ParquetFileWriter doesn't call sink_.Close when using GcsRandomAccessFile
[ https://issues.apache.org/jira/browse/ARROW-16147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc resolved ARROW-16147. Resolution: Fixed > [C++] ParquetFileWriter doesn't call sink_.Close when using > GcsRandomAccessFile > --- > > Key: ARROW-16147 > URL: https://issues.apache.org/jira/browse/ARROW-16147 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Rok Mihevc >Assignee: Micah Kornfield >Priority: Major > Labels: GCP > > On parquet::arrow::FileWriter::Close the underlying sink is not closed. The > implementation goes to FileSerializer::Close: > {code:cpp} > void Close() override { > if (is_open_) { > // If any functions here raise an exception, we set is_open_ to be false > // so that this does not get called again (possibly causing segfault) > is_open_ = false; > if (row_group_writer_) { > num_rows_ += row_group_writer_->num_rows(); > row_group_writer_->Close(); > } > row_group_writer_.reset(); > // Write magic bytes and metadata > auto file_encryption_properties = > properties_->file_encryption_properties(); > if (file_encryption_properties == nullptr) { // Non encrypted file. > file_metadata_ = metadata_->Finish(); > WriteFileMetaData(*file_metadata_, sink_.get()); > } else { // Encrypted file > CloseEncryptedFile(file_encryption_properties); > } > } > } > {code} > It doesn't call sink_->Close(), which leads to resource leaking and bugs. > With files (they have own close() in destructor) it works fine, but doesn't > work with fs::GcsRandomAccessFile. When I calling > parquet::arrow::FileWriter::Close the data is not flushed to storage, until > manual close of a sink stream (or stack space change). > Is it done by intention or a bug? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16147) [C++] ParquetFileWriter doesn't call sink_.Close when using GcsRandomAccessFile
[ https://issues.apache.org/jira/browse/ARROW-16147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598492#comment-17598492 ] Rok Mihevc commented on ARROW-16147: I think this was resolved by the linked Micah's PR. > [C++] ParquetFileWriter doesn't call sink_.Close when using > GcsRandomAccessFile > --- > > Key: ARROW-16147 > URL: https://issues.apache.org/jira/browse/ARROW-16147 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Rok Mihevc >Priority: Major > Labels: GCP > > On parquet::arrow::FileWriter::Close the underlying sink is not closed. The > implementation goes to FileSerializer::Close: > {code:cpp} > void Close() override { > if (is_open_) { > // If any functions here raise an exception, we set is_open_ to be false > // so that this does not get called again (possibly causing segfault) > is_open_ = false; > if (row_group_writer_) { > num_rows_ += row_group_writer_->num_rows(); > row_group_writer_->Close(); > } > row_group_writer_.reset(); > // Write magic bytes and metadata > auto file_encryption_properties = > properties_->file_encryption_properties(); > if (file_encryption_properties == nullptr) { // Non encrypted file. > file_metadata_ = metadata_->Finish(); > WriteFileMetaData(*file_metadata_, sink_.get()); > } else { // Encrypted file > CloseEncryptedFile(file_encryption_properties); > } > } > } > {code} > It doesn't call sink_->Close(), which leads to resource leaking and bugs. > With files (they have own close() in destructor) it works fine, but doesn't > work with fs::GcsRandomAccessFile. When I calling > parquet::arrow::FileWriter::Close the data is not flushed to storage, until > manual close of a sink stream (or stack space change). > Is it done by intention or a bug? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16147) [C++] ParquetFileWriter doesn't call sink_.Close when using GcsRandomAccessFile
[ https://issues.apache.org/jira/browse/ARROW-16147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rok Mihevc reassigned ARROW-16147: -- Assignee: Micah Kornfield > [C++] ParquetFileWriter doesn't call sink_.Close when using > GcsRandomAccessFile > --- > > Key: ARROW-16147 > URL: https://issues.apache.org/jira/browse/ARROW-16147 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Rok Mihevc >Assignee: Micah Kornfield >Priority: Major > Labels: GCP > > On parquet::arrow::FileWriter::Close the underlying sink is not closed. The > implementation goes to FileSerializer::Close: > {code:cpp} > void Close() override { > if (is_open_) { > // If any functions here raise an exception, we set is_open_ to be false > // so that this does not get called again (possibly causing segfault) > is_open_ = false; > if (row_group_writer_) { > num_rows_ += row_group_writer_->num_rows(); > row_group_writer_->Close(); > } > row_group_writer_.reset(); > // Write magic bytes and metadata > auto file_encryption_properties = > properties_->file_encryption_properties(); > if (file_encryption_properties == nullptr) { // Non encrypted file. > file_metadata_ = metadata_->Finish(); > WriteFileMetaData(*file_metadata_, sink_.get()); > } else { // Encrypted file > CloseEncryptedFile(file_encryption_properties); > } > } > } > {code} > It doesn't call sink_->Close(), which leads to resource leaking and bugs. > With files (they have own close() in destructor) it works fine, but doesn't > work with fs::GcsRandomAccessFile. When I calling > parquet::arrow::FileWriter::Close the data is not flushed to storage, until > manual close of a sink stream (or stack space change). > Is it done by intention or a bug? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15011) [R] Generate documentation for dplyr function bindings
[ https://issues.apache.org/jira/browse/ARROW-15011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15011: --- Labels: pull-request-available (was: ) > [R] Generate documentation for dplyr function bindings > -- > > Key: ARROW-15011 > URL: https://issues.apache.org/jira/browse/ARROW-15011 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > We don't want to (re)write the documentation for each binding that exists, > but could we use templates or other automated ways of documenting "This > binding should work just like X from package Y" when that's true, and then > have a place to put some of the exceptions? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15011) [R] Generate documentation for dplyr function bindings
[ https://issues.apache.org/jira/browse/ARROW-15011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-15011: Summary: [R] Generate documentation for dplyr function bindings (was: [R] Can we (semi?) automatically document when a binding exists) > [R] Generate documentation for dplyr function bindings > -- > > Key: ARROW-15011 > URL: https://issues.apache.org/jira/browse/ARROW-15011 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > > We don't want to (re)write the documentation for each binding that exists, > but could we use templates or other automated ways of documenting "This > binding should work just like X from package Y" when that's true, and then > have a place to put some of the exceptions? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-15011) [R] Generate documentation for dplyr function bindings
[ https://issues.apache.org/jira/browse/ARROW-15011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-15011: --- Assignee: Neal Richardson (was: Dragoș Moldovan-Grünfeld) > [R] Generate documentation for dplyr function bindings > -- > > Key: ARROW-15011 > URL: https://issues.apache.org/jira/browse/ARROW-15011 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Assignee: Neal Richardson >Priority: Major > > We don't want to (re)write the documentation for each binding that exists, > but could we use templates or other automated ways of documenting "This > binding should work just like X from package Y" when that's true, and then > have a place to put some of the exceptions? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17584) [Go] Unable to build with tinygo
Tim Schaub created ARROW-17584: -- Summary: [Go] Unable to build with tinygo Key: ARROW-17584 URL: https://issues.apache.org/jira/browse/ARROW-17584 Project: Apache Arrow Issue Type: Bug Components: Go Reporter: Tim Schaub I was hoping to use TinyGo to build WASM binaries with Arrow. TinyGo can generate builds that are [1% the size|https://tinygo.org/getting-started/overview/#:~:text=The%20only%20difference%20here%2C%20is,used%2C%20and%20the%20associated%20runtime.&text=In%20this%20case%20the%20Go,size%20(251k%20before%20stripping)!] of those generated with Go (significant for applications hosted on the web). Arrow's use of `reflect.SliceHeader` fields limits the portability of the code. For example, the `Len` and `Cap` fields are assumed to be `int` here: https://github.com/apache/arrow/blob/go/v9.0.0/go/arrow/bitutil/bitutil.go#L158-L159 Go's [reflect package warns|https://github.com/golang/go/blob/go1.19/src/reflect/value.go#L2675-L2685] that the SliceHeader "cannot be used safely or portably and its representation may change in a later release." Attempts to build a WASM binary using the github.com/apache/arrow/go/v10 module result in failures like this: {code} tinygo build -tags noasm -o test.wasm ./main.go {code} {code} # github.com/apache/arrow/go/v10/arrow/bitutil ../../go/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220831082949-cf27001da088/arrow/bitutil/bitutil.go:158:10: invalid operation: h.Len / uint64SizeBytes (mismatched types uintptr and int) ../../go/pkg/mod/github.com/apache/arrow/go/v10@v10.0.0-20220831082949-cf27001da088/arrow/bitutil/bitutil.go:159:10: invalid operation: h.Cap / uint64SizeBytes (mismatched types uintptr and int) {code} This happens because TinyGo uses `uintptr` for the corresponding types: https://github.com/tinygo-org/tinygo/blob/v0.25.0/src/reflect/value.go#L773-L777 This feels like an issue with TinyGo, and it has been ticketed there multiple times (see https://github.com/tinygo-org/tinygo/issues/1284). They lean on the warnings in the Go sources that use of the SliceHeader fields makes code unportable and suggest changes to the libraries that do not heed this warning. I don't have a suggested fix or alternative for Arrow's use of SliceHeader fields, but I'm wondering if there would be willingness on the part of this package to make WASM builds work with TinyGo. Perhaps the TinyGo authors could also offer suggested changes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17583) [Python] File write visitor throws exception on large parquet file
[ https://issues.apache.org/jira/browse/ARROW-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598477#comment-17598477 ] Joost Hoozemans commented on ARROW-17583: - Thanks for the quick response! This should be a small change, I think I can submit something tomorrow. > [Python] File write visitor throws exception on large parquet file > -- > > Key: ARROW-17583 > URL: https://issues.apache.org/jira/browse/ARROW-17583 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 9.0.0 >Reporter: Joost Hoozemans >Assignee: Joost Hoozemans >Priority: Major > > When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws > an exception: > Traceback (most recent call last): > File "pyarrow/_dataset_parquet.pyx", line 165, in > pyarrow._dataset_parquet.ParquetFileFormat._finish_write > File "pyarrow/{_}dataset.pyx", line 2695, in > pyarrow._dataset.WrittenFile.{_}{_}init{_}_ > OverflowError: value too large to convert to int > Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor' > The file is written succesfully though. It seems related to this issue > https://issues.apache.org/jira/browse/ARROW-16761. > I would guess the problem is the python field is an int while the C++ code > returns an int64_t > [https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17583) [Python] File write visitor throws exception on large parquet file
[ https://issues.apache.org/jira/browse/ARROW-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joost Hoozemans reassigned ARROW-17583: --- Assignee: Joost Hoozemans > [Python] File write visitor throws exception on large parquet file > -- > > Key: ARROW-17583 > URL: https://issues.apache.org/jira/browse/ARROW-17583 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 9.0.0 >Reporter: Joost Hoozemans >Assignee: Joost Hoozemans >Priority: Major > > When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws > an exception: > Traceback (most recent call last): > File "pyarrow/_dataset_parquet.pyx", line 165, in > pyarrow._dataset_parquet.ParquetFileFormat._finish_write > File "pyarrow/{_}dataset.pyx", line 2695, in > pyarrow._dataset.WrittenFile.{_}{_}init{_}_ > OverflowError: value too large to convert to int > Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor' > The file is written succesfully though. It seems related to this issue > https://issues.apache.org/jira/browse/ARROW-16761. > I would guess the problem is the python field is an int while the C++ code > returns an int64_t > [https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17583) [Python] File write visitor throws exception on large parquet file
[ https://issues.apache.org/jira/browse/ARROW-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-17583: --- Priority: Major (was: Minor) > [Python] File write visitor throws exception on large parquet file > -- > > Key: ARROW-17583 > URL: https://issues.apache.org/jira/browse/ARROW-17583 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 9.0.0 >Reporter: Joost Hoozemans >Priority: Major > > When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws > an exception: > Traceback (most recent call last): > File "pyarrow/_dataset_parquet.pyx", line 165, in > pyarrow._dataset_parquet.ParquetFileFormat._finish_write > File "pyarrow/{_}dataset.pyx", line 2695, in > pyarrow._dataset.WrittenFile.{_}{_}init{_}_ > OverflowError: value too large to convert to int > Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor' > The file is written succesfully though. It seems related to this issue > https://issues.apache.org/jira/browse/ARROW-16761. > I would guess the problem is the python field is an int while the C++ code > returns an int64_t > [https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17583) [Python] File write visitor throws exception on large parquet file
[ https://issues.apache.org/jira/browse/ARROW-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598471#comment-17598471 ] Antoine Pitrou commented on ARROW-17583: Your diagnosis seems right. Would you want to submit a PR? > [Python] File write visitor throws exception on large parquet file > -- > > Key: ARROW-17583 > URL: https://issues.apache.org/jira/browse/ARROW-17583 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 9.0.0 >Reporter: Joost Hoozemans >Priority: Minor > > When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws > an exception: > Traceback (most recent call last): > File "pyarrow/_dataset_parquet.pyx", line 165, in > pyarrow._dataset_parquet.ParquetFileFormat._finish_write > File "pyarrow/{_}dataset.pyx", line 2695, in > pyarrow._dataset.WrittenFile.{_}{_}init{_}_ > OverflowError: value too large to convert to int > Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor' > The file is written succesfully though. It seems related to this issue > https://issues.apache.org/jira/browse/ARROW-16761. > I would guess the problem is the python field is an int while the C++ code > returns an int64_t > [https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17583) [Python] File write visitor throws exception on large parquet file
[ https://issues.apache.org/jira/browse/ARROW-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joost Hoozemans updated ARROW-17583: Description: When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws an exception: Traceback (most recent call last): File "pyarrow/_dataset_parquet.pyx", line 165, in pyarrow._dataset_parquet.ParquetFileFormat._finish_write File "pyarrow/{_}dataset.pyx", line 2695, in pyarrow._dataset.WrittenFile.{_}{_}init{_}_ OverflowError: value too large to convert to int Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor' The file is written succesfully though. It seems related to this issue https://issues.apache.org/jira/browse/ARROW-16761. I would guess the problem is the python field is an int while the C++ code returns an int64_t [https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164] was: When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws an exception: Traceback (most recent call last): File "pyarrow/_dataset_parquet.pyx", line 165, in pyarrow._dataset_parquet.ParquetFileFormat._finish_write File "pyarrow/_dataset.pyx", line 2695, in pyarrow._dataset.WrittenFile.__init__ OverflowError: value too large to convert to int Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor' The file is written succesfully though. It seems related to this issue https://issues.apache.org/jira/browse/ARROW-16761. I would guess the problem is the python field is an int while the C++ code return an int64_t [https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164] > [Python] File write visitor throws exception on large parquet file > -- > > Key: ARROW-17583 > URL: https://issues.apache.org/jira/browse/ARROW-17583 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 9.0.0 >Reporter: Joost Hoozemans >Priority: Minor > > When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws > an exception: > Traceback (most recent call last): > File "pyarrow/_dataset_parquet.pyx", line 165, in > pyarrow._dataset_parquet.ParquetFileFormat._finish_write > File "pyarrow/{_}dataset.pyx", line 2695, in > pyarrow._dataset.WrittenFile.{_}{_}init{_}_ > OverflowError: value too large to convert to int > Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor' > The file is written succesfully though. It seems related to this issue > https://issues.apache.org/jira/browse/ARROW-16761. > I would guess the problem is the python field is an int while the C++ code > returns an int64_t > [https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17583) [Python] File write visitor throws exception on large parquet file
Joost Hoozemans created ARROW-17583: --- Summary: [Python] File write visitor throws exception on large parquet file Key: ARROW-17583 URL: https://issues.apache.org/jira/browse/ARROW-17583 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 9.0.0 Reporter: Joost Hoozemans When writing a large parquet file (e.g. 5GB) using pyarrow.dataset, it throws an exception: Traceback (most recent call last): File "pyarrow/_dataset_parquet.pyx", line 165, in pyarrow._dataset_parquet.ParquetFileFormat._finish_write File "pyarrow/_dataset.pyx", line 2695, in pyarrow._dataset.WrittenFile.__init__ OverflowError: value too large to convert to int Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor' The file is written succesfully though. It seems related to this issue https://issues.apache.org/jira/browse/ARROW-16761. I would guess the problem is the python field is an int while the C++ code return an int64_t [https://github.com/apache/arrow/pull/13338/files#diff-4f2eb12337651b45bab2b03abe2552dd7fc9958b1fbbeb09a2a488804b097109R164] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-17557) [Go] WASM build fails
[ https://issues.apache.org/jira/browse/ARROW-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Schaub closed ARROW-17557. -- Resolution: Fixed The wasm build works with `-tags noasm` and either github.com/apache/thrift@v0.16.0 or arrow/parquet v10. > [Go] WASM build fails > - > > Key: ARROW-17557 > URL: https://issues.apache.org/jira/browse/ARROW-17557 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Reporter: Tim Schaub >Priority: Major > > I see ARROW-4689 and it looks like > [https://github.com/apache/arrow/pull/3707] was supposed to add support for > building with {{GOOS=js GOARCH=wasm}}. > When I try to build a wasm binary, I get the following failure > {code} > # GOOS=js GOARCH=wasm go build -o test.wasm ./main.go > # github.com/apache/arrow/go/v9/internal/utils > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:76:4: > undefined: TransposeInt8Int8 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:78:4: > undefined: TransposeInt8Int16 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:80:4: > undefined: TransposeInt8Int32 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:82:4: > undefined: TransposeInt8Int64 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:84:4: > undefined: TransposeInt8Uint8 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:86:4: > undefined: TransposeInt8Uint16 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:88:4: > undefined: TransposeInt8Uint32 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:90:4: > undefined: TransposeInt8Uint64 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:95:4: > undefined: TransposeInt16Int8 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:97:4: > undefined: TransposeInt16Int16 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:97:4: > too many errors > # github.com/apache/thrift/lib/go/thrift > ../../go/pkg/mod/github.com/apache/thrift@v0.15.0/lib/go/thrift/socket_unix_conn.go:60:63: > undefined: syscall.MSG_PEEK > ../../go/pkg/mod/github.com/apache/thrift@v0.15.0/lib/go/thrift/socket_unix_conn.go:60:80: > undefined: syscall.MSG_DONTWAIT > {code} > {code} > go version go1.18.2 darwin/arm64 > {code} > {code} > github.com/apache/arrow/go/v9 v9.0.0 > {code} > Does additional code need to be generated for the {{wasm}} arch? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17557) [Go] WASM build fails
[ https://issues.apache.org/jira/browse/ARROW-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598465#comment-17598465 ] Tim Schaub commented on ARROW-17557: It looks like `-tags noasm` was one part of the issue. The other issue with thrift@0.15.0 was addressed in https://github.com/apache/thrift/pull/2455. I updated the arrow and parquet packages to v10, and the build works now. Thank you for the help. > [Go] WASM build fails > - > > Key: ARROW-17557 > URL: https://issues.apache.org/jira/browse/ARROW-17557 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Reporter: Tim Schaub >Priority: Major > > I see ARROW-4689 and it looks like > [https://github.com/apache/arrow/pull/3707] was supposed to add support for > building with {{GOOS=js GOARCH=wasm}}. > When I try to build a wasm binary, I get the following failure > {code} > # GOOS=js GOARCH=wasm go build -o test.wasm ./main.go > # github.com/apache/arrow/go/v9/internal/utils > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:76:4: > undefined: TransposeInt8Int8 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:78:4: > undefined: TransposeInt8Int16 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:80:4: > undefined: TransposeInt8Int32 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:82:4: > undefined: TransposeInt8Int64 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:84:4: > undefined: TransposeInt8Uint8 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:86:4: > undefined: TransposeInt8Uint16 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:88:4: > undefined: TransposeInt8Uint32 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:90:4: > undefined: TransposeInt8Uint64 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:95:4: > undefined: TransposeInt16Int8 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:97:4: > undefined: TransposeInt16Int16 > ../../go/pkg/mod/github.com/apache/arrow/go/v9@v9.0.0/internal/utils/transpose_ints_def.go:97:4: > too many errors > # github.com/apache/thrift/lib/go/thrift > ../../go/pkg/mod/github.com/apache/thrift@v0.15.0/lib/go/thrift/socket_unix_conn.go:60:63: > undefined: syscall.MSG_PEEK > ../../go/pkg/mod/github.com/apache/thrift@v0.15.0/lib/go/thrift/socket_unix_conn.go:60:80: > undefined: syscall.MSG_DONTWAIT > {code} > {code} > go version go1.18.2 darwin/arm64 > {code} > {code} > github.com/apache/arrow/go/v9 v9.0.0 > {code} > Does additional code need to be generated for the {{wasm}} arch? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17573) [Go] Parquet ByteArray statistics cause memory leak
[ https://issues.apache.org/jira/browse/ARROW-17573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol reassigned ARROW-17573: - Assignee: Matthew Topol > [Go] Parquet ByteArray statistics cause memory leak > --- > > Key: ARROW-17573 > URL: https://issues.apache.org/jira/browse/ARROW-17573 > Project: Apache Arrow > Issue Type: Bug > Components: Go, Parquet >Affects Versions: 9.0.0 >Reporter: Sasha Sirovica >Assignee: Matthew Topol >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > When using `arrow.BinaryTypes.String` in a schema, appending multiple > strings, and then writing a record out to parquet the memory of the program > continuously increases. This also applies for the other `arrow.BinaryTypes` > > I took a heap dump midway through the program and the majority of allocations > comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM > before terminating the program. > > I was not able to replicate this behavior with just PrimativeTypes. Another > interesting point, if the records are created but never written with pqarrow > memory does not grow. In the below program commenting out `w.Write(rec)` will > not cause memory issues. > Example program which causes memory to leak: > {code:java} > package main > import ( >"os" >"github.com/apache/arrow/go/v9/arrow" >"github.com/apache/arrow/go/v9/arrow/array" >"github.com/apache/arrow/go/v9/arrow/memory" >"github.com/apache/arrow/go/v9/parquet" >"github.com/apache/arrow/go/v9/parquet/compress" >"github.com/apache/arrow/go/v9/parquet/pqarrow" > ) > func main() { >f, _ := os.Create("/tmp/test.parquet") >arrowProps := pqarrow.DefaultWriterProps() >schema := arrow.NewSchema( > []arrow.Field{ > {Name: "aString", Type: arrow.BinaryTypes.String}, > }, > nil, >) >w, _ := pqarrow.NewFileWriter(schema, f, > parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), > arrowProps) >builder := array.NewRecordBuilder(memory.DefaultAllocator, schema) >for i := 1; i < 50; i++ { > builder.Field(0).(*array.StringBuilder).Append("HelloWorld!") > if i%200 == 0 { > // Write row groups out every 2M times > rec := builder.NewRecord() > w.Write(rec) > rec.Release() > } >} >w.Close() > }{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17569) [C++] Bump xsimd to 9.0.0
[ https://issues.apache.org/jira/browse/ARROW-17569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598455#comment-17598455 ] Bernhard Manfred Gruber commented on ARROW-17569: - Hi! I tried to push the update of xsimd to vcpkg and noticed that with the change arrow is failing to build: https://github.com/microsoft/vcpkg/pull/26501. Good to see that this is being worked on. > [C++] Bump xsimd to 9.0.0 > - > > Key: ARROW-17569 > URL: https://issues.apache.org/jira/browse/ARROW-17569 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Serge Guelton >Assignee: Serge Guelton >Priority: Minor > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > xsmd has released a new upstream version (namely 9.0.0), it would be nice to > match it -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17582) Relax / extend type checking for pyarrow array creation
[ https://issues.apache.org/jira/browse/ARROW-17582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gil Forsyth updated ARROW-17582: Description: in [ibis|https://github.com/ibis-project/ibis] we're interested in offering query results as a record batch – some of the data we're starting with is coming back from a {{sqlalchemy.cursor}} which _look_ like {{{}tuple{}}}s and {{{}dict{}}}s but are actually {{sqlalchemy.engine.row.LegacyRow}} and {{{}sqlalchemy.engine.row.RowMapping{}}}, respectively. The checks in {{python_to_arrow.cc}} are strict enough that these can't be readily dumped into an {{array}} without first calling, e.g. {{tuple}} on the individual rows of the results. {code:java} In [168]: batch[:5] Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)] In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", pa.int32())]) In [170]: pa.array(batch[:5], type=pa_schema) --- ArrowTypeError Traceback (most recent call last) Input In [170], in () > 1 pa.array(batch[:5], type=pa_schema) File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317, in pyarrow.lib.array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123, in pyarrow.lib.check_status() ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, value) pair /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938 GetKeyValuePair(items, i) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010 InferKeyKind(items) /build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73 func(value, static_cast(i), &keep_going) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182 converter->Extend(seq, size) {code} vs {code:java} In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema) Out[171]: -- is_valid: all not null -- child 0 type: int32 [ 1, 1, 1, 1, 1 ] -- child 1 type: int32 [ 2173, 943, 892, 30, 337 ]{code} To avoid the overhead of this extra conversion, maybe there are some checks that aren't explicit python type-checks that we can rely on? was: in [ibis|https://github.com/ibis-project/ibis] we're interested in offering query results as a record batch – some of the data we're starting with is coming back from a {{sqlalchemy.cursor}} which _look_ like {{{}tuple{}}}s and {{{}dict{}}}s but are actually {{sqlalchemy.engine.row.LegacyRow}} and {{{}sqlalchemy.engine.row.RowMapping{}}}, respectively. The checks in `python_to_arrow.cc` are strict enough that these can't be readily dumped into an `array` without first calling, e.g. `tuple` on the individual rows of the results. {code:java} In [168]: batch[:5] Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)] In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", pa.int32())]) In [170]: pa.array(batch[:5], type=pa_schema) --- ArrowTypeError Traceback (most recent call last) Input In [170], in () > 1 pa.array(batch[:5], type=pa_schema) File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317, in pyarrow.lib.array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123, in pyarrow.lib.check_status() ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, value) pair /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938 GetKeyValuePair(items, i) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010 InferKeyKind(items) /build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73 func(value, static_cast(i), &keep_going) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182 converter->Extend(seq, size) {code} vs {code:java} In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema) Out[171]: -- is_valid: all not null -- child 0 type: int32 [ 1
[jira] [Updated] (ARROW-17582) Relax / extend type checking for pyarrow array creation
[ https://issues.apache.org/jira/browse/ARROW-17582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gil Forsyth updated ARROW-17582: Description: in [ibis|https://github.com/ibis-project/ibis] we're interested in offering query results as a record batch – some of the data we're starting with is coming back from a {{sqlalchemy.cursor}} which _look_ like {{{}tuple{}}}s and {{{}dict{}}}s but are actually {{sqlalchemy.engine.row.LegacyRow}} and {{{}sqlalchemy.engine.row.RowMapping{}}}, respectively. The checks in `python_to_arrow.cc` are strict enough that these can't be readily dumped into an `array` without first calling, e.g. `tuple` on the individual rows of the results. {code:java} In [168]: batch[:5] Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)] In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", pa.int32())]) In [170]: pa.array(batch[:5], type=pa_schema) --- ArrowTypeError Traceback (most recent call last) Input In [170], in () > 1 pa.array(batch[:5], type=pa_schema) File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317, in pyarrow.lib.array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123, in pyarrow.lib.check_status() ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, value) pair /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938 GetKeyValuePair(items, i) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010 InferKeyKind(items) /build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73 func(value, static_cast(i), &keep_going) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182 converter->Extend(seq, size) {code} vs {code:java} In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema) Out[171]: -- is_valid: all not null -- child 0 type: int32 [ 1, 1, 1, 1, 1 ] -- child 1 type: int32 [ 2173, 943, 892, 30, 337 ]{code} To avoid the overhead of this extra conversion, maybe there are some checks that aren't explicit python type-checks that we can rely on? was: in [ibis|https://github.com/ibis-project/ibis] we're interested in offering query results as a record batch – some of the data we're starting with is coming back from a `sqlalchemy.cursor` which _look_ like `tuple`s and `dict`s but are actually `sqlalchemy.engine.row.LegacyRow` and `sqlalchemy.engine.row.RowMapping`, respectively. The checks in `python_to_arrow.cc` are strict enough that these can't be readily dumped into an `array` without first calling, e.g. `tuple` on the individual rows of the results. {code:java} In [168]: batch[:5] Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)] In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", pa.int32())]) In [170]: pa.array(batch[:5], type=pa_schema) --- ArrowTypeError Traceback (most recent call last) Input In [170], in () > 1 pa.array(batch[:5], type=pa_schema) File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317, in pyarrow.lib.array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123, in pyarrow.lib.check_status() ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, value) pair /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938 GetKeyValuePair(items, i) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010 InferKeyKind(items) /build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73 func(value, static_cast(i), &keep_going) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182 converter->Extend(seq, size) {code} vs {code:java} In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema) Out[171]: -- is_valid: all not null -- child 0 type: int32 [ 1, 1, 1, 1,
[jira] [Updated] (ARROW-17582) Relax / extend type checking for pyarrow array creation
[ https://issues.apache.org/jira/browse/ARROW-17582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gil Forsyth updated ARROW-17582: Description: in [ibis|https://github.com/ibis-project/ibis] we're interested in offering query results as a record batch – some of the data we're starting with is coming back from a `sqlalchemy.cursor` which _look_ like `tuple`s and `dict`s but are actually `sqlalchemy.engine.row.LegacyRow` and `sqlalchemy.engine.row.RowMapping`, respectively. The checks in `python_to_arrow.cc` are strict enough that these can't be readily dumped into an `array` without first calling, e.g. `tuple` on the individual rows of the results. {code:java} In [168]: batch[:5] Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)] In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", pa.int32())]) In [170]: pa.array(batch[:5], type=pa_schema) --- ArrowTypeError Traceback (most recent call last) Input In [170], in () > 1 pa.array(batch[:5], type=pa_schema) File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317, in pyarrow.lib.array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123, in pyarrow.lib.check_status() ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, value) pair /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938 GetKeyValuePair(items, i) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010 InferKeyKind(items) /build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73 func(value, static_cast(i), &keep_going) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182 converter->Extend(seq, size) {code} vs {code:java} In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema) Out[171]: -- is_valid: all not null -- child 0 type: int32 [ 1, 1, 1, 1, 1 ] -- child 1 type: int32 [ 2173, 943, 892, 30, 337 ]{code} To avoid the overhead of this extra conversion, maybe there are some checks that aren't explicit python type-checks that we can rely on? was: in ibis we're interested in offering query results as a record batch – some of the data we're starting with is coming back from a `sqlalchemy.cursor` which _look_ like `tuple`s and `dict`s but are actually `sqlalchemy.engine.row.LegacyRow` and `sqlalchemy.engine.row.RowMapping`, respectively. The checks in `python_to_arrow.cc` are strict enough that these can't be readily dumped into an `array` without first calling, e.g. `tuple` on the individual rows of the results. {code:java} In [168]: batch[:5] Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)] In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", pa.int32())]) In [170]: pa.array(batch[:5], type=pa_schema) --- ArrowTypeError Traceback (most recent call last) Input In [170], in () > 1 pa.array(batch[:5], type=pa_schema) File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317, in pyarrow.lib.array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123, in pyarrow.lib.check_status() ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, value) pair /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938 GetKeyValuePair(items, i) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010 InferKeyKind(items) /build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73 func(value, static_cast(i), &keep_going) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182 converter->Extend(seq, size) {code} vs {code:java} In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema) Out[171]: -- is_valid: all not null -- child 0 type: int32 [ 1, 1, 1, 1, 1 ] -- child 1 type: int32 [ 2173,
[jira] [Created] (ARROW-17582) Relax / extend type checking for pyarrow array creation
Gil Forsyth created ARROW-17582: --- Summary: Relax / extend type checking for pyarrow array creation Key: ARROW-17582 URL: https://issues.apache.org/jira/browse/ARROW-17582 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Gil Forsyth in ibis we're interested in offering query results as a record batch – some of the data we're starting with is coming back from a `sqlalchemy.cursor` which _look_ like `tuple`s and `dict`s but are actually `sqlalchemy.engine.row.LegacyRow` and `sqlalchemy.engine.row.RowMapping`, respectively. The checks in `python_to_arrow.cc` are strict enough that these can't be readily dumped into an `array` without first calling, e.g. `tuple` on the individual rows of the results. {code:java} In [168]: batch[:5] Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)] In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", pa.int32())]) In [170]: pa.array(batch[:5], type=pa_schema) --- ArrowTypeError Traceback (most recent call last) Input In [170], in () > 1 pa.array(batch[:5], type=pa_schema) File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317, in pyarrow.lib.array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status() File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123, in pyarrow.lib.check_status() ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, value) pair /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938 GetKeyValuePair(items, i) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010 InferKeyKind(items) /build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73 func(value, static_cast(i), &keep_going) /build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182 converter->Extend(seq, size) {code} vs {code:java} In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema) Out[171]: -- is_valid: all not null -- child 0 type: int32 [ 1, 1, 1, 1, 1 ] -- child 1 type: int32 [ 2173, 943, 892, 30, 337 ]{code} To avoid the overhead of this extra conversion, maybe there are some checks that aren't explicit python type-checks that we can rely on? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17573) [Go] Parquet ByteArray statistics cause memory leak
[ https://issues.apache.org/jira/browse/ARROW-17573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17573: --- Labels: pull-request-available (was: ) > [Go] Parquet ByteArray statistics cause memory leak > --- > > Key: ARROW-17573 > URL: https://issues.apache.org/jira/browse/ARROW-17573 > Project: Apache Arrow > Issue Type: Bug > Components: Go, Parquet >Affects Versions: 9.0.0 >Reporter: Sasha Sirovica >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When using `arrow.BinaryTypes.String` in a schema, appending multiple > strings, and then writing a record out to parquet the memory of the program > continuously increases. This also applies for the other `arrow.BinaryTypes` > > I took a heap dump midway through the program and the majority of allocations > comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM > before terminating the program. > > I was not able to replicate this behavior with just PrimativeTypes. Another > interesting point, if the records are created but never written with pqarrow > memory does not grow. In the below program commenting out `w.Write(rec)` will > not cause memory issues. > Example program which causes memory to leak: > {code:java} > package main > import ( >"os" >"github.com/apache/arrow/go/v9/arrow" >"github.com/apache/arrow/go/v9/arrow/array" >"github.com/apache/arrow/go/v9/arrow/memory" >"github.com/apache/arrow/go/v9/parquet" >"github.com/apache/arrow/go/v9/parquet/compress" >"github.com/apache/arrow/go/v9/parquet/pqarrow" > ) > func main() { >f, _ := os.Create("/tmp/test.parquet") >arrowProps := pqarrow.DefaultWriterProps() >schema := arrow.NewSchema( > []arrow.Field{ > {Name: "aString", Type: arrow.BinaryTypes.String}, > }, > nil, >) >w, _ := pqarrow.NewFileWriter(schema, f, > parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), > arrowProps) >builder := array.NewRecordBuilder(memory.DefaultAllocator, schema) >for i := 1; i < 50; i++ { > builder.Field(0).(*array.StringBuilder).Append("HelloWorld!") > if i%200 == 0 { > // Write row groups out every 2M times > rec := builder.NewRecord() > w.Write(rec) > rec.Release() > } >} >w.Close() > }{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17573) [Go] Parquet ByteArray statistics cause memory leak
[ https://issues.apache.org/jira/browse/ARROW-17573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol updated ARROW-17573: -- Component/s: Parquet > [Go] Parquet ByteArray statistics cause memory leak > --- > > Key: ARROW-17573 > URL: https://issues.apache.org/jira/browse/ARROW-17573 > Project: Apache Arrow > Issue Type: Bug > Components: Go, Parquet >Affects Versions: 9.0.0 >Reporter: Sasha Sirovica >Priority: Major > > When using `arrow.BinaryTypes.String` in a schema, appending multiple > strings, and then writing a record out to parquet the memory of the program > continuously increases. This also applies for the other `arrow.BinaryTypes` > > I took a heap dump midway through the program and the majority of allocations > comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM > before terminating the program. > > I was not able to replicate this behavior with just PrimativeTypes. Another > interesting point, if the records are created but never written with pqarrow > memory does not grow. In the below program commenting out `w.Write(rec)` will > not cause memory issues. > Example program which causes memory to leak: > {code:java} > package main > import ( >"os" >"github.com/apache/arrow/go/v9/arrow" >"github.com/apache/arrow/go/v9/arrow/array" >"github.com/apache/arrow/go/v9/arrow/memory" >"github.com/apache/arrow/go/v9/parquet" >"github.com/apache/arrow/go/v9/parquet/compress" >"github.com/apache/arrow/go/v9/parquet/pqarrow" > ) > func main() { >f, _ := os.Create("/tmp/test.parquet") >arrowProps := pqarrow.DefaultWriterProps() >schema := arrow.NewSchema( > []arrow.Field{ > {Name: "aString", Type: arrow.BinaryTypes.String}, > }, > nil, >) >w, _ := pqarrow.NewFileWriter(schema, f, > parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), > arrowProps) >builder := array.NewRecordBuilder(memory.DefaultAllocator, schema) >for i := 1; i < 50; i++ { > builder.Field(0).(*array.StringBuilder).Append("HelloWorld!") > if i%200 == 0 { > // Write row groups out every 2M times > rec := builder.NewRecord() > w.Write(rec) > rec.Release() > } >} >w.Close() > }{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17573) [Go] Parquet ByteArray statistics cause memory leak
[ https://issues.apache.org/jira/browse/ARROW-17573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Topol updated ARROW-17573: -- Summary: [Go] Parquet ByteArray statistics cause memory leak (was: [Go] String Binary Builder Leaks Memory When Writing to Parquet) > [Go] Parquet ByteArray statistics cause memory leak > --- > > Key: ARROW-17573 > URL: https://issues.apache.org/jira/browse/ARROW-17573 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Affects Versions: 9.0.0 >Reporter: Sasha Sirovica >Priority: Major > > When using `arrow.BinaryTypes.String` in a schema, appending multiple > strings, and then writing a record out to parquet the memory of the program > continuously increases. This also applies for the other `arrow.BinaryTypes` > > I took a heap dump midway through the program and the majority of allocations > comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM > before terminating the program. > > I was not able to replicate this behavior with just PrimativeTypes. Another > interesting point, if the records are created but never written with pqarrow > memory does not grow. In the below program commenting out `w.Write(rec)` will > not cause memory issues. > Example program which causes memory to leak: > {code:java} > package main > import ( >"os" >"github.com/apache/arrow/go/v9/arrow" >"github.com/apache/arrow/go/v9/arrow/array" >"github.com/apache/arrow/go/v9/arrow/memory" >"github.com/apache/arrow/go/v9/parquet" >"github.com/apache/arrow/go/v9/parquet/compress" >"github.com/apache/arrow/go/v9/parquet/pqarrow" > ) > func main() { >f, _ := os.Create("/tmp/test.parquet") >arrowProps := pqarrow.DefaultWriterProps() >schema := arrow.NewSchema( > []arrow.Field{ > {Name: "aString", Type: arrow.BinaryTypes.String}, > }, > nil, >) >w, _ := pqarrow.NewFileWriter(schema, f, > parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), > arrowProps) >builder := array.NewRecordBuilder(memory.DefaultAllocator, schema) >for i := 1; i < 50; i++ { > builder.Field(0).(*array.StringBuilder).Append("HelloWorld!") > if i%200 == 0 { > // Write row groups out every 2M times > rec := builder.NewRecord() > w.Write(rec) > rec.Release() > } >} >w.Close() > }{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17581) [R] Refactor build_expr and eval_array_expression to remove special casing
Neal Richardson created ARROW-17581: --- Summary: [R] Refactor build_expr and eval_array_expression to remove special casing Key: ARROW-17581 URL: https://issues.apache.org/jira/browse/ARROW-17581 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 10.0.0 As [~paleolimbot] observes [here|https://github.com/apache/arrow/pull/13985#discussion_r957286453], we should avoid adding additional complexity or indirection in how expressions/bindings are defined--it's complex enough as is. We have helper functions {{build_expr}} (used with Acero, wrapper around Expression$create, returns Expression) and {{eval_array_expression}} (for eager computation on Arrays, wrapper around call_function) that wrap input arguments as Scalars or whatever, but they also do some special casing for functions that need custom handling. However, since those functions were initially written, we've developed other ways to handle these special cases more explicitly, and not all operations pass through these helper functions. We should pull out the special cases and define those functions/bindings explicitly and only use these helpers in the simple case where no extra logic is required. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16728) [Python] Switch default and deprecate use_legacy_dataset=True in ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-16728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598414#comment-17598414 ] Raúl Cumplido commented on ARROW-16728: --- [~jorisvandenbossche] was your idea for this one to be done on two different releases? First - DeprecationWarning and switch default to use_legacy_dataset=False Second - Remove possibility of using use_legacy_dataset=True > [Python] Switch default and deprecate use_legacy_dataset=True in > ParquetDataset > --- > > Key: ARROW-16728 > URL: https://issues.apache.org/jira/browse/ARROW-16728 > Project: Apache Arrow > Issue Type: Sub-task > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > > The ParquetDataset() constructor itself still defaults to > {{use_legacy_dataset=True}} (although using specific attributes or keywords > related to that will raise a warning). So a next step will be to actually > deprecate passing that and switching the default, and then only afterwards we > can remove the code. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-15691) [Dev] Update archery to work with either master or main as default branch
[ https://issues.apache.org/jira/browse/ARROW-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fiona La reassigned ARROW-15691: Assignee: Fiona La > [Dev] Update archery to work with either master or main as default branch > - > > Key: ARROW-15691 > URL: https://issues.apache.org/jira/browse/ARROW-15691 > Project: Apache Arrow > Issue Type: Sub-task > Components: Developer Tools >Reporter: Neal Richardson >Assignee: Fiona La >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598383#comment-17598383 ] Arthur Passos edited comment on ARROW-17459 at 8/31/22 2:31 PM: [~emkornfield] if I understand correctly, this could help with the original case I shared. In the case [~willjones127] shared, where he creates a ChunkedArray and then serializes it, it wouldn't help. Is that correct? I am stating this based on my current understanding of the inner workings of `arrow`: The ChunkedArray data structure will be used in two or more situations: 1. The data in a row group exceeds the limit of INT_MAX (Case I initially shared) 2. The serialized data/ table is a chunked array, thus it makes sense to use a chunked array. edit: I have just tested the snippet shared by Will Jones using `type = pa.map_(pa.large_string(), pa.int64())` instead of `type = pa.map_(pa.string(), pa.int32())` and the issue persists. was (Author: JIRAUSER294600): [~emkornfield] if I understand correctly, this could help with the original case I shared. In the case [~willjones127] shared, where he creates a ChunkedArray and then serializes it, it wouldn't help. Is that correct? I am stating this based on my current understanding of the inner workings of `arrow`: The ChunkedArray data structure will be used in two or more situations: 1. The data in a row group exceeds the limit of INT_MAX (Case I initially shared) 2. The serialized data/ table is a chunked array, thus it makes sense to use a chunked array. > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17541) [R] Substantial RAM use increase in 9.0.0 release on write_dataset()
[ https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598406#comment-17598406 ] Weston Pace commented on ARROW-17541: - > Would a more precise way to say this be that there is some shared pointer > (potentially held by an R6 object that is still in scope and not being > destroyed) that is keeping the record batches from being freed? We do have an > R reference to the exec plan and the final node of the exec plan (which would > be the penultimate node in the dataset write, which is probably the scan > node). (It still makes no sense to me why the batches aren't getting > released). Yes, I think that is a more precise way. Holding onto the ExecPlan (which owns the nodes too) should be ok (indeed, desirable). > [R] Substantial RAM use increase in 9.0.0 release on write_dataset() > > > Key: ARROW-17541 > URL: https://issues.apache.org/jira/browse/ARROW-17541 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: Carl Boettiger >Priority: Critical > Attachments: Screenshot 2022-08-30 at 14-23-20 Online Graph Maker · > Plotly Chart Studio.png > > > Consider the following example of opening a remote dataset (a single 4 GB > parquet file) and streaming it to disk. Consider this reprex: > > {code:java} > s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", > anonymous=TRUE) > df <- arrow::open_dataset(s3$path("waq_test")) > arrow::write_dataset(df, tempfile()) > {code} > In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already > surprisingly high (when the whole file is 4 GB when on disk), but on arrow > 9.0.0 RAM use for the same operation approximately doubles, which is large > enough to trigger the OOM killer on the task in several of our active > production workflows. > > Can this large RAM use increase introduced in 9.0 be avoided? Is it possible > for this operation to use even less RAM than it does in 8.0 release? Is > there something about this particular parquet file that should be responsible > for the large RAM use? > > Arrow's impressively fast performance on large data on remote hosts is really > game-changing for us. Still, the OOM errors are a bit unexpected at this > scale (i.e. single 4GB parquet file), as R users we really depend on arrow's > out-of-band operations to work with larger-than-RAM data. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17541) [R] Substantial RAM use increase in 9.0.0 release on write_dataset()
[ https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-17541: Priority: Critical (was: Major) > [R] Substantial RAM use increase in 9.0.0 release on write_dataset() > > > Key: ARROW-17541 > URL: https://issues.apache.org/jira/browse/ARROW-17541 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: Carl Boettiger >Priority: Critical > Attachments: Screenshot 2022-08-30 at 14-23-20 Online Graph Maker · > Plotly Chart Studio.png > > > Consider the following example of opening a remote dataset (a single 4 GB > parquet file) and streaming it to disk. Consider this reprex: > > {code:java} > s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", > anonymous=TRUE) > df <- arrow::open_dataset(s3$path("waq_test")) > arrow::write_dataset(df, tempfile()) > {code} > In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already > surprisingly high (when the whole file is 4 GB when on disk), but on arrow > 9.0.0 RAM use for the same operation approximately doubles, which is large > enough to trigger the OOM killer on the task in several of our active > production workflows. > > Can this large RAM use increase introduced in 9.0 be avoided? Is it possible > for this operation to use even less RAM than it does in 8.0 release? Is > there something about this particular parquet file that should be responsible > for the large RAM use? > > Arrow's impressively fast performance on large data on remote hosts is really > game-changing for us. Still, the OOM errors are a bit unexpected at this > scale (i.e. single 4GB parquet file), as R users we really depend on arrow's > out-of-band operations to work with larger-than-RAM data. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17579) [Python] PYARROW_CXXFLAGS ignored?
[ https://issues.apache.org/jira/browse/ARROW-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598400#comment-17598400 ] Alenka Frim commented on ARROW-17579: - Oh, got it. Not sure I can help but will for sure dig into it! > [Python] PYARROW_CXXFLAGS ignored? > -- > > Key: ARROW-17579 > URL: https://issues.apache.org/jira/browse/ARROW-17579 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > > In {{setup.py}}, I see that the {{PYARROW_CXXFLAGS}} environment variable is > read, but its value then seems to be ignored. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17579) [Python] PYARROW_CXXFLAGS ignored?
[ https://issues.apache.org/jira/browse/ARROW-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598393#comment-17598393 ] Antoine Pitrou commented on ARROW-17579: Even if I try to pass {{PYARROW_CXXFLAGS}} directly to CMake, it seems to be used for linking but not compiling, as reported in ARROW-17580. > [Python] PYARROW_CXXFLAGS ignored? > -- > > Key: ARROW-17579 > URL: https://issues.apache.org/jira/browse/ARROW-17579 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > > In {{setup.py}}, I see that the {{PYARROW_CXXFLAGS}} environment variable is > read, but its value then seems to be ignored. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17568) [FlightRPC][Integration] Ensure all RPC methods are covered by integration testing
[ https://issues.apache.org/jira/browse/ARROW-17568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598392#comment-17598392 ] David Li commented on ARROW-17568: -- https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/integration_tests/test_integration.cc and https://github.com/apache/arrow/blob/cf27001da088d882a7d460cddd84a0202f3d8eba/dev/archery/archery/integration/runner.py#L424-L438 > [FlightRPC][Integration] Ensure all RPC methods are covered by integration > testing > -- > > Key: ARROW-17568 > URL: https://issues.apache.org/jira/browse/ARROW-17568 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC, Go, Integration, Java >Reporter: David Li >Priority: Major > > This would help catch issues like https://github.com/apache/arrow/issues/13853 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17579) [Python] PYARROW_CXXFLAGS ignored?
[ https://issues.apache.org/jira/browse/ARROW-17579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598391#comment-17598391 ] Alenka Frim commented on ARROW-17579: - I guess so. I think {{self.cmake_cxxflags}} should be added to {{cmake_options}} in {{_run_cmake_pyarrow_cpp}} and {{_run_cmake}} to be red by the CMake when building. > [Python] PYARROW_CXXFLAGS ignored? > -- > > Key: ARROW-17579 > URL: https://issues.apache.org/jira/browse/ARROW-17579 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > > In {{setup.py}}, I see that the {{PYARROW_CXXFLAGS}} environment variable is > read, but its value then seems to be ignored. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598383#comment-17598383 ] Arthur Passos commented on ARROW-17459: --- [~emkornfield] if I understand correctly, this could help with the original case I shared. In the case [~willjones127] shared, where he creates a ChunkedArray and then serializes it, it wouldn't help. Is that correct? I am stating this based on my current understanding of the inner workings of `arrow`: The ChunkedArray data structure will be used in two or more situations: 1. The data in a row group exceeds the limit of INT_MAX (Case I initially shared) 2. The serialized data/ table is a chunked array, thus it makes sense to use a chunked array. > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17580) [Doc][C++][Python] Unclear how to influence compilation flags
[ https://issues.apache.org/jira/browse/ARROW-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598375#comment-17598375 ] Antoine Pitrou commented on ARROW-17580: Another problem seems to be that PyArrow uses two independent CMakeLists files that are not related to each other (meaning two different sets of CMake invocations, with different possible options...). > [Doc][C++][Python] Unclear how to influence compilation flags > - > > Key: ARROW-17580 > URL: https://issues.apache.org/jira/browse/ARROW-17580 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Python >Reporter: Antoine Pitrou >Priority: Critical > > Frequently people need to customize compilation flags for C++ and/or C files. > Unfortunately, both for Arrow C++ and PyArrow, it is very difficult to find > out the proper way to do this. > For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while > the {{CXXFLAGS}} environment variable is ignored (it probably shouldn't?). > For PyArrow, I have not found a way to do it. The {{CXXFLAGS}} environment > variable is ignored, and the {{PYARROW_CXXFLAGS}} CMake variable has two > problems: > * it is only recognized for Cython-generated files, not for PyArrow C++ > sources > * it only affects linker calls, while it should actually affect compiler calls -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17580) [Doc][C++][Python] Unclear how to influence compilation flags
Antoine Pitrou created ARROW-17580: -- Summary: [Doc][C++][Python] Unclear how to influence compilation flags Key: ARROW-17580 URL: https://issues.apache.org/jira/browse/ARROW-17580 Project: Apache Arrow Issue Type: Bug Components: C++, Documentation, Python Reporter: Antoine Pitrou Frequently people need to customize compilation flags for C++ and/or C files. Unfortunately, both for Arrow C++ and PyArrow, it is very difficult to find out the proper way to do this. For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while the {{CXXFLAGS}} environment variable is ignored (it probably shouldn't?). For PyArrow, I have not found a way to do it. The {{CXXFLAGS}} environment is ignored, and the {{PYARROW_CXXFLAGS}} has two problems: * it is only recognized for Cython-generated files, not for PyArrow C++ sources * it only affects linker calls, while it should actually affect compiler calls -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17580) [Doc][C++][Python] Unclear how to influence compilation flags
[ https://issues.apache.org/jira/browse/ARROW-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598372#comment-17598372 ] Antoine Pitrou commented on ARROW-17580: cc [~alenka] [~kou] > [Doc][C++][Python] Unclear how to influence compilation flags > - > > Key: ARROW-17580 > URL: https://issues.apache.org/jira/browse/ARROW-17580 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Python >Reporter: Antoine Pitrou >Priority: Critical > > Frequently people need to customize compilation flags for C++ and/or C files. > Unfortunately, both for Arrow C++ and PyArrow, it is very difficult to find > out the proper way to do this. > For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while > the {{CXXFLAGS}} environment variable is ignored (it probably shouldn't?). > For PyArrow, I have not found a way to do it. The {{CXXFLAGS}} environment is > ignored, and the {{PYARROW_CXXFLAGS}} has two problems: > * it is only recognized for Cython-generated files, not for PyArrow C++ > sources > * it only affects linker calls, while it should actually affect compiler calls -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17580) [Doc][C++][Python] Unclear how to influence compilation flags
[ https://issues.apache.org/jira/browse/ARROW-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-17580: --- Description: Frequently people need to customize compilation flags for C++ and/or C files. Unfortunately, both for Arrow C++ and PyArrow, it is very difficult to find out the proper way to do this. For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while the {{CXXFLAGS}} environment variable is ignored (it probably shouldn't?). For PyArrow, I have not found a way to do it. The {{CXXFLAGS}} environment variable is ignored, and the {{PYARROW_CXXFLAGS}} CMake variable has two problems: * it is only recognized for Cython-generated files, not for PyArrow C++ sources * it only affects linker calls, while it should actually affect compiler calls was: Frequently people need to customize compilation flags for C++ and/or C files. Unfortunately, both for Arrow C++ and PyArrow, it is very difficult to find out the proper way to do this. For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while the {{CXXFLAGS}} environment variable is ignored (it probably shouldn't?). For PyArrow, I have not found a way to do it. The {{CXXFLAGS}} environment is ignored, and the {{PYARROW_CXXFLAGS}} has two problems: * it is only recognized for Cython-generated files, not for PyArrow C++ sources * it only affects linker calls, while it should actually affect compiler calls > [Doc][C++][Python] Unclear how to influence compilation flags > - > > Key: ARROW-17580 > URL: https://issues.apache.org/jira/browse/ARROW-17580 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Python >Reporter: Antoine Pitrou >Priority: Critical > > Frequently people need to customize compilation flags for C++ and/or C files. > Unfortunately, both for Arrow C++ and PyArrow, it is very difficult to find > out the proper way to do this. > For Arrow C++, it seems {{ARROW_CXXFLAGS}} should be passed to CMake, while > the {{CXXFLAGS}} environment variable is ignored (it probably shouldn't?). > For PyArrow, I have not found a way to do it. The {{CXXFLAGS}} environment > variable is ignored, and the {{PYARROW_CXXFLAGS}} CMake variable has two > problems: > * it is only recognized for Cython-generated files, not for PyArrow C++ > sources > * it only affects linker calls, while it should actually affect compiler calls -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17579) [Python] PYARROW_CXXFLAGS ignored?
Antoine Pitrou created ARROW-17579: -- Summary: [Python] PYARROW_CXXFLAGS ignored? Key: ARROW-17579 URL: https://issues.apache.org/jira/browse/ARROW-17579 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Antoine Pitrou In {{setup.py}}, I see that the {{PYARROW_CXXFLAGS}} environment variable is read, but its value then seems to be ignored. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17330) [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array concatenation
[ https://issues.apache.org/jira/browse/ARROW-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Smirnov reassigned ARROW-17330: -- Assignee: (was: Alexey Smirnov) > [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array > concatenation > --- > > Key: ARROW-17330 > URL: https://issues.apache.org/jira/browse/ARROW-17330 > Project: Apache Arrow > Issue Type: Improvement > Components: C# >Affects Versions: 8.0.0 >Reporter: Alexey Smirnov >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Extend ArrowBuffer.BitmapBuilder with Append method overloaded with > ReadOnlySpan parameter. > This allows to add validity bits to the builder more efficiently (especially > for cases when initial validity bits are added to newly created empty > builder). More over it makes BitmapBuilder API more consistent (for example > ArrowBuffer.Builder does have such method). > Currently adding new bits to existing bitmap is implemented in > ArrayDataConcatenator, Code adds bit by bit in a cycle converting each to a > boolean value: > for (int i = 0; i < length; i++) > { > builder.Append(span.IsEmpty || BitUtility.GetBit(span, i)); > } > Initial problem was described in this email: > https://lists.apache.org/thread/kls6tjq2hclsvd16tw901ooo5soojrmb > PR: https://github.com/apache/arrow/pull/13810 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17330) [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array concatenation
[ https://issues.apache.org/jira/browse/ARROW-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598339#comment-17598339 ] Alexey Smirnov edited comment on ARROW-17330 at 8/31/22 11:31 AM: -- Pull request is ready. Issue requires verification and approval was (Author: JIRAUSER293436): Pull request is ready. Issue requires verification approval > [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array > concatenation > --- > > Key: ARROW-17330 > URL: https://issues.apache.org/jira/browse/ARROW-17330 > Project: Apache Arrow > Issue Type: Improvement > Components: C# >Affects Versions: 8.0.0 >Reporter: Alexey Smirnov >Assignee: Alexey Smirnov >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Extend ArrowBuffer.BitmapBuilder with Append method overloaded with > ReadOnlySpan parameter. > This allows to add validity bits to the builder more efficiently (especially > for cases when initial validity bits are added to newly created empty > builder). More over it makes BitmapBuilder API more consistent (for example > ArrowBuffer.Builder does have such method). > Currently adding new bits to existing bitmap is implemented in > ArrayDataConcatenator, Code adds bit by bit in a cycle converting each to a > boolean value: > for (int i = 0; i < length; i++) > { > builder.Append(span.IsEmpty || BitUtility.GetBit(span, i)); > } > Initial problem was described in this email: > https://lists.apache.org/thread/kls6tjq2hclsvd16tw901ooo5soojrmb > PR: https://github.com/apache/arrow/pull/13810 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (ARROW-17330) [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array concatenation
[ https://issues.apache.org/jira/browse/ARROW-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Smirnov reopened ARROW-17330: Pull request is ready. Issue requires verification approval > [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array > concatenation > --- > > Key: ARROW-17330 > URL: https://issues.apache.org/jira/browse/ARROW-17330 > Project: Apache Arrow > Issue Type: Improvement > Components: C# >Affects Versions: 8.0.0 >Reporter: Alexey Smirnov >Assignee: Alexey Smirnov >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Extend ArrowBuffer.BitmapBuilder with Append method overloaded with > ReadOnlySpan parameter. > This allows to add validity bits to the builder more efficiently (especially > for cases when initial validity bits are added to newly created empty > builder). More over it makes BitmapBuilder API more consistent (for example > ArrowBuffer.Builder does have such method). > Currently adding new bits to existing bitmap is implemented in > ArrayDataConcatenator, Code adds bit by bit in a cycle converting each to a > boolean value: > for (int i = 0; i < length; i++) > { > builder.Append(span.IsEmpty || BitUtility.GetBit(span, i)); > } > Initial problem was described in this email: > https://lists.apache.org/thread/kls6tjq2hclsvd16tw901ooo5soojrmb > PR: https://github.com/apache/arrow/pull/13810 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17330) [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array concatenation
[ https://issues.apache.org/jira/browse/ARROW-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Smirnov resolved ARROW-17330. Resolution: Implemented > [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array > concatenation > --- > > Key: ARROW-17330 > URL: https://issues.apache.org/jira/browse/ARROW-17330 > Project: Apache Arrow > Issue Type: Improvement > Components: C# >Affects Versions: 8.0.0 >Reporter: Alexey Smirnov >Assignee: Alexey Smirnov >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Extend ArrowBuffer.BitmapBuilder with Append method overloaded with > ReadOnlySpan parameter. > This allows to add validity bits to the builder more efficiently (especially > for cases when initial validity bits are added to newly created empty > builder). More over it makes BitmapBuilder API more consistent (for example > ArrowBuffer.Builder does have such method). > Currently adding new bits to existing bitmap is implemented in > ArrayDataConcatenator, Code adds bit by bit in a cycle converting each to a > boolean value: > for (int i = 0; i < length; i++) > { > builder.Append(span.IsEmpty || BitUtility.GetBit(span, i)); > } > Initial problem was described in this email: > https://lists.apache.org/thread/kls6tjq2hclsvd16tw901ooo5soojrmb > PR: https://github.com/apache/arrow/pull/13810 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17330) [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array concatenation
[ https://issues.apache.org/jira/browse/ARROW-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Smirnov reassigned ARROW-17330: -- Assignee: Alexey Smirnov > [C#] Extend ArrowBuffer.BitmapBuilder to improve performance of array > concatenation > --- > > Key: ARROW-17330 > URL: https://issues.apache.org/jira/browse/ARROW-17330 > Project: Apache Arrow > Issue Type: Improvement > Components: C# >Affects Versions: 8.0.0 >Reporter: Alexey Smirnov >Assignee: Alexey Smirnov >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Extend ArrowBuffer.BitmapBuilder with Append method overloaded with > ReadOnlySpan parameter. > This allows to add validity bits to the builder more efficiently (especially > for cases when initial validity bits are added to newly created empty > builder). More over it makes BitmapBuilder API more consistent (for example > ArrowBuffer.Builder does have such method). > Currently adding new bits to existing bitmap is implemented in > ArrayDataConcatenator, Code adds bit by bit in a cycle converting each to a > boolean value: > for (int i = 0; i < length; i++) > { > builder.Append(span.IsEmpty || BitUtility.GetBit(span, i)); > } > Initial problem was described in this email: > https://lists.apache.org/thread/kls6tjq2hclsvd16tw901ooo5soojrmb > PR: https://github.com/apache/arrow/pull/13810 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17578) [CI] Nightly test-r-gcc-12 fails to build
Raúl Cumplido created ARROW-17578: - Summary: [CI] Nightly test-r-gcc-12 fails to build Key: ARROW-17578 URL: https://issues.apache.org/jira/browse/ARROW-17578 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Raúl Cumplido [test-r-gcc-12|https://github.com/ursacomputing/crossbow/runs/8104457062?check_suite_focus=true] has been failing to build since the 18th of August. The current error log is: {code:java} #4 ERROR: executor failed running [/bin/bash -o pipefail -c apt-get update -y && apt-get install -y dirmngr apt-transport-https software-properties-common && wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc && add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu '$(lsb_release -cs)'-cran40/' && apt-get install -y r-base=${r}* r-recommended=${r}* libxml2-dev libgit2-dev libssl-dev clang clang-format clang-tidy texlive-latex-base locales python3 python3-pip python3-dev && locale-gen en_US.UTF-8 && apt-get clean && rm -rf /var/lib/apt/lists/*]: exit code: 100 -- > [ 2/17] RUN apt-get update -y && apt-get install -y dirmngr apt-transport-https software-properties-common && wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc && add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu '$(lsb_release -cs)'-cran40/' && apt-get install -y r-base=4.2* r-recommended=4.2* libxml2-dev libgit2-dev libssl-dev clang clang-format clang-tidy texlive-latex-base locales python3 python3-pip python3-dev && locale-gen en_US.UTF-8 && apt-get clean && rm -rf /var/lib/apt/lists/*: -- executor failed running [/bin/bash -o pipefail -c apt-get update -y && apt-get install -y dirmngr apt-transport-https software-properties-common && wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc && add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu '$(lsb_release -cs)'-cran40/' && apt-get install -y r-base=${r}* r-recommended=${r}* libxml2-dev libgit2-dev libssl-dev clang clang-format clang-tidy texlive-latex-base locales python3 python3-pip python3-dev && locale-gen en_US.UTF-8 && apt-get clean && rm -rf /var/lib/apt/lists/*]: exit code: 100 Service 'ubuntu-r-only-r' failed to build : Build failed {code} I can't reproduce locally (I get a different error) but could it be that we are pulling an outdated upstream docker image? I don't think the changes introduced on the first build failure are relevant to this failure but here they are: https://github.com/apache/arrow/compare/6e8f0e4d327180375dda53287a5a600ba139ce3d...a1c3d57af514d4a84e753ff51df8e563135ee55e cc~ [~kou] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17483) Support for 'pa.compute.Expression' in filter argument to 'pa.read_table'
[ https://issues.apache.org/jira/browse/ARROW-17483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miles Granger reassigned ARROW-17483: - Assignee: Miles Granger > Support for 'pa.compute.Expression' in filter argument to 'pa.read_table' > - > > Key: ARROW-17483 > URL: https://issues.apache.org/jira/browse/ARROW-17483 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Patrik Kjærran >Assignee: Miles Granger >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently, the _filters_ argument supports {{{}List{}}}[{{{}Tuple{}}}] or > {{{}List{}}}[{{{}List{}}}[{{{}Tuple{}}}]] or None as its input types. I was > suprised to see that Expressions were not supported, considering that filters > are converted to expressions internally when using use_legacy_dataset=False. > The check on > [L150-L153|https://github.com/apache/arrow/blob/28cf3f9f769dda11ddfe52bd316c96aecb656522/python/pyarrow/parquet/core.py#L150-L153] > short-circuits and succeeds when encountering an expression, but later fails > on > [L2343|https://github.com/apache/arrow/blob/28cf3f9f769dda11ddfe52bd316c96aecb656522/python/pyarrow/parquet/core.py#L2343] > as the expression is evaluated as part of a boolean expression. > I think declaring filters using pa.compute.Expressions more pythonic and less > error-prone, and ill-formed filters will be detected much earlier than when > using list-of-tuple-of-string equivalents. > *Example:* > {code:java} > import pyarrow as pa > import pyarrow.compute as pc > import pyarrow.parquet as pq > # Creating a dummy table > table = pa.table({ > 'year': [2020, 2022, 2021, 2022, 2019, 2021], > 'n_legs': [2, 2, 4, 4, 5, 100], > 'animal': ["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", > "Centipede"] > }) > pq.write_to_dataset(table, root_path='dataset_name_2', > partition_cols=['year']) > # Reading using 'pyarrow.compute.Expression' > pq.read_table('dataset_name_2', columns=["n_legs", "animal"], > filters=pc.field("n_legs") < 4) > # Reading using List[Tuple] > pq.read_table('dataset_name_2', columns=["n_legs", "animal"], > filters=[('n_legs', '<', 4)]) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17483) Support for 'pa.compute.Expression' in filter argument to 'pa.read_table'
[ https://issues.apache.org/jira/browse/ARROW-17483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17483: --- Labels: pull-request-available (was: ) > Support for 'pa.compute.Expression' in filter argument to 'pa.read_table' > - > > Key: ARROW-17483 > URL: https://issues.apache.org/jira/browse/ARROW-17483 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Patrik Kjærran >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently, the _filters_ argument supports {{{}List{}}}[{{{}Tuple{}}}] or > {{{}List{}}}[{{{}List{}}}[{{{}Tuple{}}}]] or None as its input types. I was > suprised to see that Expressions were not supported, considering that filters > are converted to expressions internally when using use_legacy_dataset=False. > The check on > [L150-L153|https://github.com/apache/arrow/blob/28cf3f9f769dda11ddfe52bd316c96aecb656522/python/pyarrow/parquet/core.py#L150-L153] > short-circuits and succeeds when encountering an expression, but later fails > on > [L2343|https://github.com/apache/arrow/blob/28cf3f9f769dda11ddfe52bd316c96aecb656522/python/pyarrow/parquet/core.py#L2343] > as the expression is evaluated as part of a boolean expression. > I think declaring filters using pa.compute.Expressions more pythonic and less > error-prone, and ill-formed filters will be detected much earlier than when > using list-of-tuple-of-string equivalents. > *Example:* > {code:java} > import pyarrow as pa > import pyarrow.compute as pc > import pyarrow.parquet as pq > # Creating a dummy table > table = pa.table({ > 'year': [2020, 2022, 2021, 2022, 2019, 2021], > 'n_legs': [2, 2, 4, 4, 5, 100], > 'animal': ["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", > "Centipede"] > }) > pq.write_to_dataset(table, root_path='dataset_name_2', > partition_cols=['year']) > # Reading using 'pyarrow.compute.Expression' > pq.read_table('dataset_name_2', columns=["n_legs", "animal"], > filters=pc.field("n_legs") < 4) > # Reading using List[Tuple] > pq.read_table('dataset_name_2', columns=["n_legs", "animal"], > filters=[('n_legs', '<', 4)]) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17571) [Benchmarks] Default build for PyArrow seems to be debug
[ https://issues.apache.org/jira/browse/ARROW-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-17571: --- Labels: pull-request-available (was: ) > [Benchmarks] Default build for PyArrow seems to be debug > > > Key: ARROW-17571 > URL: https://issues.apache.org/jira/browse/ARROW-17571 > Project: Apache Arrow > Issue Type: Bug > Components: Benchmarking, Python >Reporter: Alenka Frim >Assignee: Alenka Frim >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > After a benchmark regression was identified in the [Python refactoring > PR|https://github.com/apache/arrow/pull/13311] we identified the cause is in > the build script for benchmarks. In the file _dev/conbench_envs/hooks.sh_ the > script used to build PyArrow is _ci/scripts/python_build.sh_ where the > default for PyArrow build type is set to *debug* (assuming _CMAKE_BUILD_TYPE_ > isn't defined) > See: > [https://github.com/apache/arrow/blob/74dae618ed8d6b492bf3b88e3b9b7dfd4c21e8d8/dev/conbench_envs/hooks.sh#L60-L62] > [https://github.com/apache/arrow/blob/93b63e8f3b4880927ccbd5522c967df79e926cda/ci/scripts/python_build.sh#L55] > > I think we need to change the build type to release in > _dev/conbench_envs/hooks.sh_ (_build_arrow_python()_) or maybe better to set > the variable _CMAKE_BUILD_TYPE_ to release in > _dev/conbench_envs/benchmarks.env_. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17577) [C++][Python] CMake cannot find Arrow/Arrow Python when building PyArrow
Alenka Frim created ARROW-17577: --- Summary: [C++][Python] CMake cannot find Arrow/Arrow Python when building PyArrow Key: ARROW-17577 URL: https://issues.apache.org/jira/browse/ARROW-17577 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Alenka Frim When building on master yesterday the PyArrow built worked fine. Today there is an issue with CMake unable to find packages. See: {code:java} -- Finished CMake build and install for PyArrow C++ creating /Users/alenkafrim/repos/arrow/python/build/temp.macosx-12-arm64-3.9 -- Running cmake for PyArrow cmake -DPYTHON_EXECUTABLE=/Users/alenkafrim/repos/pyarrow-dev-9/bin/python -DPython3_EXECUTABLE=/Users/alenkafrim/repos/pyarrow-dev-9/bin/python -DPYARROW_CPP_HOME=/Users/alenkafrim/repos/arrow/python/build/dist "" -DPYARROW_BUILD_CUDA=off -DPYARROW_BUILD_SUBSTRAIT=off -DPYARROW_BUILD_FLIGHT=on -DPYARROW_BUILD_GANDIVA=off -DPYARROW_BUILD_DATASET=on -DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=on -DPYARROW_BUILD_PARQUET_ENCRYPTION=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_GCS=off -DPYARROW_BUILD_S3=on -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release /Users/alenkafrim/repos/arrow/python CMake Warning: Ignoring empty string ("") provided on the command line. -- The C compiler identification is AppleClang 13.1.6.13160021 -- The CXX compiler identification is AppleClang 13.1.6.13160021 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- System processor: arm64 -- Performing Test CXX_SUPPORTS_ARMV8_ARCH -- Performing Test CXX_SUPPORTS_ARMV8_ARCH - Success -- Arrow build warning level: PRODUCTION -- Configured for RELEASE build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...}) -- Build Type: RELEASE -- Generator: Unix Makefiles -- Build output directory: /Users/alenkafrim/repos/arrow/python/build/temp.macosx-12-arm64-3.9/release -- Found Python3: /Users/alenkafrim/repos/pyarrow-dev-9/bin/python (found version "3.9.13") found components: Interpreter Development.Module NumPy -- Found Python3Alt: /Users/alenkafrim/repos/pyarrow-dev-9/bin/python CMake Error at /opt/homebrew/Cellar/cmake/3.24.1/share/cmake/Modules/CMakeFindDependencyMacro.cmake:47 (find_package): By not providing "FindArrow.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "Arrow", but CMake did not find one. Could not find a package configuration file provided by "Arrow" with any of the following names: ArrowConfig.cmake arrow-config.cmake Add the installation prefix of "Arrow" to CMAKE_PREFIX_PATH or set "Arrow_DIR" to a directory containing one of the above files. If "Arrow" provides a separate development package or SDK, be sure it has been installed. Call Stack (most recent call first): build/dist/lib/cmake/ArrowPython/ArrowPythonConfig.cmake:54 (find_dependency) CMakeLists.txt:240 (find_package) {code} I did a clean built on the latest master. Am I missing some variables that need to be set after [https://github.com/apache/arrow/pull/13892] ? I am calling cmake with these flags: {code:java} cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ -DCMAKE_INSTALL_LIBDIR=lib \ -DCMAKE_BUILD_TYPE=debug \ -DARROW_WITH_BZ2=ON \ -DARROW_WITH_ZLIB=ON \ -DARROW_WITH_ZSTD=ON \ -DARROW_WITH_LZ4=ON \ -DARROW_WITH_SNAPPY=ON \ -DARROW_WITH_BROTLI=ON \ -DARROW_PLASMA=OFF \ -DARROW_PARQUET=ON \ -DPARQUET_REQUIRE_ENCRYPTION=OFF \ -DARROW_PYTHON=ON \ -DARROW_FLIGHT=ON \ -DARROW_JEMALLOC=OFF \ -DARROW_S3=ON \ -DARROW_GCS=OFF \ -DARROW_BUILD_TESTS=ON \ -DARROW_DEPENDENCY_SOURCE=AUTO \ -DARROW_INSTALL_NAME_RPATH=OFF \ -DARROW_EXTRA_ERROR_CONTEXT=ON \ -GNinja \ .. popd {code} and building python with {code:java} python setup.py build_ext --inplace {code} cc [~kou] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17543) [R] %in% on an empty vector c() fails
[ https://issues.apache.org/jira/browse/ARROW-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicola Crane resolved ARROW-17543. -- Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 13990 [https://github.com/apache/arrow/pull/13990] > [R] %in% on an empty vector c() fails > - > > Key: ARROW-17543 > URL: https://issues.apache.org/jira/browse/ARROW-17543 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: Egill Axfjord Fridgeirsson >Assignee: Egill Axfjord Fridgeirsson >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > When using %in% on empty vectors I'm getting an error > "Error: Cannot infer type from vector" > I'd expect this to work the same as base R where you can use %in% on empty > vectors. > The arrow::is_in compute function does accept nulls as the value_set. If I > manually create an empty array of type NULL it does work as expected. > Reprex: > {code:java} > library(dplyr) > library(arrow) > options(arrow.debug=T) > #base R > a <- c(1,2,3) > b <- c() # NULL > a %in% b > #> [1] FALSE FALSE FALSE > # arrow arrays > arrowArray <- arrow::Array$create(c(1,2,3)) > arrow::is_in(arrowArray, c()) > #> Error: Cannot infer type from vector > # define type of c() manually > arrow::is_in(arrowArray, arrow::Array$create(c(), type=arrow::null())) > #> Array > #> > #> [ > #> false, > #> false, > #> false > #> ] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17576) conda r-arrow Linux package has
[ https://issues.apache.org/jira/browse/ARROW-17576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hans-Martin von Gaudecker updated ARROW-17576: -- Description: I need to read parquet files in R using a conda environment. Works great on Windows, but on Linux, r-arrow comes without some core features. If this is expected, it would be great to flag it in the docs, at least for me, reading through [https://arrow.apache.org/docs/r/articles/install.html#method-1a---binary-r-package-containing-libarrow-binary-via-rspmconda] gives the impression that I need not worry about features (though pinning r-arrow to 8.0.1 gives a complete version lacking only lzo-support). After creating an environment based on the attached specification: {{{}(test-r-arrow) x@x/test-r-arrow$ R{}}}{{{}R version 4.1.3 (2022-03-10) – "One Push-Up"{}}} {{Copyright (C) 2022 The R Foundation for Statistical Computing}} {{Platform: x86_64-conda-linux-gnu (64-bit)}} {{R is free software and comes with ABSOLUTELY NO WARRANTY.}} {{You are welcome to redistribute it under certain conditions.}} {{{}Type 'license()' or 'licence()' for distribution details.{}}}{\{ }} {{Natural language support but running in an English locale}} {{R is a collaborative project with many contributors.}} {{Type 'contributors()' for more information and}} {{'citation()' on how to cite R or R packages in publications.}} {{Type 'demo()' for some demos, 'help()' for on-line help, or}} {{'help.start()' for an HTML browser interface to help.}} {{Type 'q()' to quit R.}} {{> library(arrow)}} {{Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.}} {{Attaching package: ‘arrow’}} {{The following object is masked from ‘package:utils’:}}{{ timestamp}} {{> arrow_info()}} {{Arrow package version: 9.0.0}} {{Capabilities:}} {{ }} {{dataset FALSE}} {{substrait FALSE}} {{parquet FALSE}} {{json FALSE}} {{s3 FALSE}} {{gcs FALSE}} {{utf8proc TRUE}} {{re2 TRUE}} {{snappy TRUE}} {{gzip TRUE}} {{brotli TRUE}} {{zstd TRUE}} {{lz4 TRUE}} {{lz4_frame TRUE}} {{lzo FALSE}} {{bz2 TRUE}} {{jemalloc TRUE}} {{mimalloc TRUE}} {{To reinstall with more optional capabilities enabled, see}} {{ [https://arrow.apache.org/docs/r/articles/install.html]}} {{Memory:}} {{ }} {{Allocator jemalloc}} {{Current 0 bytes}} {{{}Max 0 bytes{}}}{{{}Runtime:{}}} {{ }} {{SIMD Level avx2}} {{{}Detected SIMD Level avx2{}}}{{{}Build:{}}} {{ }} {{C++ Library Version 9.0.0}} {{C++ Compiler GNU}} {{C++ Compiler Version 10.4.0}} {{Git ID 13127e16b858dda3b8299a1e435c3c0ba5934fdc}} Creating the same environment on Windows produces: {{> arrow_info()}} {{Arrow package version: 9.0.0}} {{Capabilities:}} {{dataset TRUE}} {{substrait FALSE}} {{parquet TRUE}} {{json TRUE}} {{s3 TRUE}} {{gcs FALSE}} {{utf8proc TRUE}} {{re2 TRUE}} {{snappy TRUE}} {{gzip TRUE}} {{brotli TRUE}} {{zstd TRUE}} {{lz4 TRUE}} {{lz4_frame TRUE}} {{lzo FALSE}} {{bz2 TRUE}} {{jemalloc FALSE}} {{mimalloc TRUE}} {{Arrow options():}} {{arrow.use_threads FALSE}} {{{}Memory:{}}}{{{}Allocator mimalloc{}}} {{Current 0 bytes}} {{Max 0 bytes}} {{Runtime:}} {{SIMD Level avx2}} {{Detected SIMD Level avx2}} {{Build:}} {{C++ Library Version 9.0.0}} {{C++ Compiler MSVC}} {{C++ Compiler Version 19.16.27048.0}} was: I need to read parquet files in R using a conda environment. Works great on Windows, but on Linux, r-arrow comes without some core features. If this is expected, it would be great to flag it in the docs, at least for me, reading through [https://arrow.apache.org/docs/r/articles/install.html#method-1a---binary-r-package-containing-libarrow-binary-via-rspmconda] gives the impression that I need not worry about features (though pinning r-arrow to 8.0.1 gives a complete version lacking only lzo-support). After creating an environment based on the attached specification: {{{}(test-r-arrow) x@x/test-r-arrow$ R{}}}{{{}R version 4.1.3 (2022-03-10) – "One Push-Up"{}}} {{Copyright (C) 2022 The R Foundation for Statistical Computing}} {{Platform: x86_64-conda-linux-gnu (64-bit)}} {{R is free software and comes with ABSOLUTELY NO WARRANTY.}} {{You are welcome to redistribute it under certain conditions.}} {{{}Type 'license()' or 'licence()' for distribution details.{}}}{\{ }} {{Natural language support but running in an English locale}} {{R is a collaborative project with many contributors.}} {{Type 'contributors()' for
[jira] [Updated] (ARROW-17576) conda r-arrow Linux package has
[ https://issues.apache.org/jira/browse/ARROW-17576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hans-Martin von Gaudecker updated ARROW-17576: -- Description: I need to read parquet files in R using a conda environment. Works great on Windows, but on Linux, r-arrow comes without some core features. If this is expected, it would be great to flag it in the docs, at least for me, reading through [https://arrow.apache.org/docs/r/articles/install.html#method-1a---binary-r-package-containing-libarrow-binary-via-rspmconda] gives the impression that I need not worry about features (though pinning r-arrow to 8.0.1 gives a complete version lacking only lzo-support). After creating an environment based on the attached specification: {{{}(test-r-arrow) x@x/test-r-arrow$ R{}}}{{{}R version 4.1.3 (2022-03-10) – "One Push-Up"{}}} {{Copyright (C) 2022 The R Foundation for Statistical Computing}} {{Platform: x86_64-conda-linux-gnu (64-bit)}} {{R is free software and comes with ABSOLUTELY NO WARRANTY.}} {{You are welcome to redistribute it under certain conditions.}} {{{}Type 'license()' or 'licence()' for distribution details.{}}}{\{ }} {{Natural language support but running in an English locale}} {{R is a collaborative project with many contributors.}} {{Type 'contributors()' for more information and}} {{'citation()' on how to cite R or R packages in publications.}} {{Type 'demo()' for some demos, 'help()' for on-line help, or}} {{'help.start()' for an HTML browser interface to help.}} {{Type 'q()' to quit R.}} {{> library(arrow)}} {{Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.}} {{Attaching package: ‘arrow’}} {{The following object is masked from ‘package:utils’:}}{{ timestamp}} {{> arrow_info()}} {{Arrow package version: 9.0.0}} {{Capabilities:}} {{ }} {{dataset FALSE}} {{substrait FALSE}} {{parquet FALSE}} {{json FALSE}} {{s3 FALSE}} {{gcs FALSE}} {{utf8proc TRUE}} {{re2 TRUE}} {{snappy TRUE}} {{gzip TRUE}} {{brotli TRUE}} {{zstd TRUE}} {{lz4 TRUE}} {{lz4_frame TRUE}} {{lzo FALSE}} {{bz2 TRUE}} {{jemalloc TRUE}} {{mimalloc TRUE}} {{To reinstall with more optional capabilities enabled, see}} {{{} [https://arrow.apache.org/docs/r/articles/install.html]{}}}{{{}{}}} {{Memory:}} {{ }} {{Allocator jemalloc}} {{Current 0 bytes}} {{{}Max 0 bytes{}}}{{{}Runtime:{}}} {{ }} {{SIMD Level avx2}} {{{}Detected SIMD Level avx2{}}}{{{}Build:{}}} {{ }} {{C++ Library Version 9.0.0}} {{C++ Compiler GNU}} {{C++ Compiler Version 10.4.0}} {{Git ID 13127e16b858dda3b8299a1e435c3c0ba5934fdc}} Creating the same environment on Windows produces: {{> arrow_info()}} {{Arrow package version: 9.0.0}} {{Capabilities:}} {{dataset TRUE}} {{substrait FALSE}} {{parquet TRUE}} {{json TRUE}} {{s3 TRUE}} {{gcs FALSE}} {{utf8proc TRUE}} {{re2 TRUE}} {{snappy TRUE}} {{gzip TRUE}} {{brotli TRUE}} {{zstd TRUE}} {{lz4 TRUE}} {{lz4_frame TRUE}} {{lzo FALSE}} {{bz2 TRUE}} {{jemalloc FALSE}} {{mimalloc TRUE}} {{Arrow options():}} {{arrow.use_threads FALSE}} {{{}Memory:{}}}{{{}Allocator mimalloc{}}} {{Current 0 bytes}} {{Max 0 bytes}} {{Runtime:}} {{SIMD Level avx2}} {{Detected SIMD Level avx2}} {{Build:}} {{C++ Library Version 9.0.0}} {{C++ Compiler MSVC}} {{C++ Compiler Version 19.16.27048.0}} was: I need to read parquet files in R using a conda environment. Works great on Windows, but on Linux, r-arrow comes without some core features. If this is expected, it would be great to flag it in the docs, at least for me, reading through [https://arrow.apache.org/docs/r/articles/install.html#method-1a---binary-r-package-containing-libarrow-binary-via-rspmconda] gives the impression that I need not worry about features (though pinning r-arrow to 8.0.1 gives a complete version lacking only lzo-support). After creating an environment based on the attached specification: {{{}(test-r-arrow) x@x/test-r-arrow$ R{}}}{{{}R version 4.1.3 (2022-03-10) -- "One Push-Up"{}}} {{Copyright (C) 2022 The R Foundation for Statistical Computing}} {{Platform: x86_64-conda-linux-gnu (64-bit)}} {{R is free software and comes with ABSOLUTELY NO WARRANTY.}} {{You are welcome to redistribute it under certain conditions.}} {{Type 'license()' or 'licence()' for distribution details.}}{{ }} {{Natural language support but running in an English locale}} {{R is a collaborative project with many contributors.}} {{Type 'contributors
[jira] [Created] (ARROW-17576) conda r-arrow Linux package has
Hans-Martin von Gaudecker created ARROW-17576: - Summary: conda r-arrow Linux package has Key: ARROW-17576 URL: https://issues.apache.org/jira/browse/ARROW-17576 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 9.0.0 Environment: Ubuntu 20.04 conda 4.13.0 py39hf3d152e_1conda-forge Environment created based on attached script: # NameVersion Build Channel _libgcc_mutex 0.1 conda_forgeconda-forge _openmp_mutex 4.5 2_gnuconda-forge _r-mutex 1.0.1 anacondar_1conda-forge abseil-cpp20211102.0 h27087fc_1conda-forge arrow-cpp 9.0.0 py310h893e394_0_cpuconda-forge aws-c-cal 0.5.11 h95a6274_0conda-forge aws-c-common 0.6.2h7f98852_0conda-forge aws-c-event-stream0.2.7 h3541f99_13conda-forge aws-c-io 0.10.5 hfb6a706_0conda-forge aws-checksums 0.1.11 ha31a3da_7conda-forge aws-sdk-cpp 1.8.186 hb4091e7_3conda-forge binutils_impl_linux-642.36.1 h193b22a_2conda-forge bwidget 1.9.14 ha770c72_1conda-forge bzip2 1.0.8h7f98852_4conda-forge c-ares1.18.1 h7f98852_0conda-forge ca-certificates 2022.6.15ha878542_0conda-forge cairo 1.16.0ha61ee94_1013conda-forge curl 7.83.1 h7bff187_0conda-forge expat 2.4.8h27087fc_0conda-forge font-ttf-dejavu-sans-mono 2.37 hab24e00_0conda-forge font-ttf-inconsolata 3.000h77eed37_0conda-forge font-ttf-source-code-pro 2.038h77eed37_0conda-forge font-ttf-ubuntu 0.83 hab24e00_0conda-forge fontconfig2.14.0 h8e229c2_0conda-forge fonts-conda-ecosystem 1 0conda-forge fonts-conda-forge 1 0conda-forge freetype 2.12.1 hca18f0e_0conda-forge fribidi 1.0.10 h36c2ea0_0conda-forge gcc_impl_linux-64 12.1.0 hea43390_16conda-forge gettext 0.19.8.1 h73d1719_1008conda-forge gflags2.2.2 he1b5a44_1004conda-forge gfortran_impl_linux-6412.1.0 h1db8e46_16conda-forge glog 0.6.0h6f12383_0conda-forge graphite2 1.3.13h58526e2_1001conda-forge grpc-cpp 1.46.3 hbd84cd8_3conda-forge gsl 2.7 he838d99_0conda-forge gxx_impl_linux-64 12.1.0 hea43390_16conda-forge harfbuzz 5.1.0hf9f4e7c_0conda-forge icu 70.1 h27087fc_0conda-forge jpeg 9e h166bdaf_2conda-forge kernel-headers_linux-64 2.6.32 he073ed8_15conda-forge keyutils 1.6.1h166bdaf_0conda-forge krb5 1.19.3 h3790be6_0conda-forge ld_impl_linux-64 2.36.1 hea4e1c9_2conda-forge lerc 4.0.0h27087fc_0conda-forge libblas 3.9.0 16_linux64_openblasconda-forge libbrotlicommon 1.0.9h166bdaf_7conda-forge libbrotlidec 1.0.9h166bdaf_7conda-forge libbrotlienc 1.0.9h166bdaf_7conda-forge libcblas 3.9.0 16_linux64_openblasconda-forge libcrc32c 1.1.2h9c3ff4c_0conda-forge libcurl 7.83.1 h7bff187_0conda-forge libdeflate1.13 h166bdaf_0conda-forge libedit 3.1.20191231 he28a2e2_2conda-forge libev 4.33 h516909a_1conda-forge libevent 2.1.10 h9b69904_4conda-forge libffi3.4.2h7f98852_5conda-forge libgcc-devel_linux-64 12.1.0 h1ec3361_16conda-forge libgcc-ng 12.1.0 h8d9b700_16conda-forge libgfortran-ng12.1.0 h69a702a_16conda-forge libgfortran
[jira] [Commented] (ARROW-17568) [FlightRPC][Integration] Ensure all RPC methods are covered by integration testing
[ https://issues.apache.org/jira/browse/ARROW-17568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598210#comment-17598210 ] Kun Liu commented on ARROW-17568: - hi [~lidavidm] Do we have the framework to test the RPC method and compatibility in the IT? > [FlightRPC][Integration] Ensure all RPC methods are covered by integration > testing > -- > > Key: ARROW-17568 > URL: https://issues.apache.org/jira/browse/ARROW-17568 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC, Go, Integration, Java >Reporter: David Li >Priority: Major > > This would help catch issues like https://github.com/apache/arrow/issues/13853 -- This message was sent by Atlassian Jira (v8.20.10#820010)