kolfild26 commented on issue #44513:
URL: https://github.com/apache/arrow/issues/44513#issuecomment-2544141432
@zanmato1984
Stacktrace:
```bash
Dec 16 01:07:44 kernel: python[37938]: segfault at 7f3004626050 ip
00007f3fc25441cd sp 00007f3f10b09018 error 4 in
libarrow.so.1801[7f3fc1670000+2269000]
Dec 16 01:07:44 kernel: python[37971]: segfault at 7f3004626050 ip
00007f3fc25441db sp 00007f3f002b0018 error 4
Dec 16 01:07:44 kernel: python[37961]: segfault at 7f3004626050 ip
00007f3fc25441cd sp 00007f3f052d0018 error 4 in
libarrow.so.1801[7f3fc1670000+2269000]
Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000]
Dec 16 01:07:44 kernel: python[37957]: segfault at 7f3004626050 ip
00007f3fc25441db sp 00007f3f072d8018 error 4
Dec 16 01:07:44 kernel: python[37940]: segfault at 7f3004626050 ip
00007f3fc25441cd sp 00007f3f0fb07018 error 4
Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000]
Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000]
Dec 16 01:07:44 kernel: python[37974]: segfault at 7f3004626050 ip
00007f3fc25441cd sp 00007f3d18f6d018 error 4 in
libarrow.so.1801[7f3fc1670000+2269000]
Dec 16 01:07:44 kernel: python[37966]: segfault at 7f3004626050 ip
00007f3fc25441db sp 00007f3f02abf018 error 4
Dec 16 01:07:44 kernel: python[37951]: segfault at 7f3004626050 ip
00007f3fc25441db sp 00007f3f0a2ec018 error 4
Dec 16 01:07:44 kernel: python[37973]: segfault at 7f3004626050 ip
00007f3fc25441cd sp 00007f3efb7fe018 error 4
Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000]
Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000]
Dec 16 01:07:44 kernel: python[37953]: segfault at 7f3004626050 ip
00007f3fc25441db sp 00007f3f092e6018 error 4
Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000]
Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000]
Dec 16 01:07:44 abrt-hook-ccpp: Process 35963 (python3.10) of user 1000
killed by SIGSEGV - dumping core
```
Here is the tables's statistics:
<details>
<summary>Script to get stats</summary>
```python
import pyarrow as pa
import pyarrow.compute as pc
import pandas as pd
import pyarrow.types as patypes
def get_column_distributions(table):
distributions = {}
total_rows = table.num_rows
for column in table.schema.names:
col_data = table[column]
null_count = pc.sum(pc.is_null(col_data)).as_py()
null_percentage = (null_count / total_rows) * 100 if total_rows > 0
else 0
# Compute the cardinality (unique count / total count)
unique_count =
pc.count_distinct(col_data.filter(pc.is_valid(col_data))).as_py()
cardinality_percentage = round((unique_count / total_rows)*100,3) if
total_rows > 0 else 0
if patypes.is_integer(col_data.type) or
patypes.is_floating(col_data.type):
stats = {
"count": pc.count(col_data).as_py(),
"nulls": null_count,
"null_percentage": null_percentage,
"cardinality_percentage": cardinality_percentage,
"min": pc.min(col_data).as_py(),
"max": pc.max(col_data).as_py(),
}
elif patypes.is_string(col_data.type) or
patypes.is_binary(col_data.type):
value_counts =
pc.value_counts(col_data.filter(pc.is_valid(col_data)))
stats = {
"nulls": null_count,
"null_percentage": null_percentage,
"cardinality_percentage": cardinality_percentage,
"value_counts": value_counts.to_pandas().to_dict("records"),
}
else:
stats = {
"nulls": null_count,
"null_percentage": null_percentage,
"cardinality_percentage": cardinality_percentage,
"message": f"Statistics not supported for type:
{col_data.type}"
}
distributions[column] = stats
return distributions
```
</details>
<details>
<summary>small</summary>

</details>
<details>
<summary>large</summary>

</details>
Would it be easier if I attached the tables here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]