[ 
https://issues.apache.org/jira/browse/ARROW-17719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17605601#comment-17605601
 ] 

Philipp Moritz commented on ARROW-17719:
----------------------------------------

So I was digging through the code today and noticed that this was already 
implemented and the error goes away if I set fragments = kInspectAllFragments in

[https://github.com/apache/arrow/blob/f8661e032902a963b0a6a46077d72e804d22560d/cpp/src/arrow/dataset/discovery.h#L60]

In the file, an efficiency argument is being made to set this parameter to 1 by 
default. I wonder if we should change the default to kInspectAllFragments – my 
understanding is that the schema can be read quickly by seeking to the end of 
the file and therefore the performance impact should be minimal and correctness 
is more important than performance. If this is a bottleneck for somebody, they 
can set the parameter to 1 or specify an explicit schema.

Any thoughts about this?

 

> [Python] Improve error message when all values in a column are null in a 
> parquet partition
> ------------------------------------------------------------------------------------------
>
>                 Key: ARROW-17719
>                 URL: https://issues.apache.org/jira/browse/ARROW-17719
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 9.0.0
>            Reporter: Philipp Moritz
>            Priority: Minor
>             Fix For: 10.0.0
>
>
> There is a good bug report about this in 
> [https://stackoverflow.com/a/70568419/10891801] and it still seems to be a 
> problem.
> Basically the error message is pretty bad if all values in a given column of 
> a parquet partition are null. We should either handle this case better or 
> give a better error message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to