[jira] [Commented] (ARROW-10438) [C++][Dataset] Partitioning::Format on nulls

Weston Pace (Jira) Tue, 19 Jan 2021 18:51:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268333#comment-17268333
 ]


Weston Pace commented on ARROW-10438:
-------------------------------------

I've created a test for this.  Currently the code bails (Grouping on a field 
with nulls).  It appears the default behavior in Hive 
([https://github.com/apache/hive/blob/1d5e6bdff99cd5aa7b885f001635d7231c3b9d44/common/src/java/org/apache/hadoop/hive/common/FileUtils.java#L271)]
 is to use the string "__HIVE_DEFAULT_PARTITION__" as the value.  Googling 
around for this value confirms that seems to be how it is used in practice.

 

Furthermore, in Hive, empty strings also map to this value.  So empty strings 
and null will map to the same partition.  I'm assuming we want compatibility 
with Hive in this way?  Impala did things slightly differently to avoid the 
ambiguity (https://issues.apache.org/jira/browse/IMPALA-252) by choosing to 
reject with an error data that had empty strings.  However, this sort of 
strictness doesn't seem quite in keeping with Arrow.

> [C++][Dataset] Partitioning::Format on nulls
> --------------------------------------------
>
>                 Key: ARROW-10438
>                 URL: https://issues.apache.org/jira/browse/ARROW-10438
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 2.0.0
>            Reporter: Ben Kietzman
>            Assignee: Weston Pace
>            Priority: Major
>             Fix For: 4.0.0
>
>
> Writing a dataset with null partition keys is currently untested. Ensure the 
> behavior is documented and correct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10438) [C++][Dataset] Partitioning::Format on nulls

Reply via email to