[ https://issues.apache.org/jira/browse/ARROW-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268333#comment-17268333 ]
Weston Pace commented on ARROW-10438: ------------------------------------- I've created a test for this. Currently the code bails (Grouping on a field with nulls). It appears the default behavior in Hive ([https://github.com/apache/hive/blob/1d5e6bdff99cd5aa7b885f001635d7231c3b9d44/common/src/java/org/apache/hadoop/hive/common/FileUtils.java#L271)] is to use the string "__HIVE_DEFAULT_PARTITION__" as the value. Googling around for this value confirms that seems to be how it is used in practice. Furthermore, in Hive, empty strings also map to this value. So empty strings and null will map to the same partition. I'm assuming we want compatibility with Hive in this way? Impala did things slightly differently to avoid the ambiguity (https://issues.apache.org/jira/browse/IMPALA-252) by choosing to reject with an error data that had empty strings. However, this sort of strictness doesn't seem quite in keeping with Arrow. > [C++][Dataset] Partitioning::Format on nulls > -------------------------------------------- > > Key: ARROW-10438 > URL: https://issues.apache.org/jira/browse/ARROW-10438 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Affects Versions: 2.0.0 > Reporter: Ben Kietzman > Assignee: Weston Pace > Priority: Major > Fix For: 4.0.0 > > > Writing a dataset with null partition keys is currently untested. Ensure the > behavior is documented and correct -- This message was sent by Atlassian Jira (v8.3.4#803005)