[ https://issues.apache.org/jira/browse/ARROW-14122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420213#comment-17420213 ]
QP Hou edited comment on ARROW-14122 at 9/26/21, 5:58 AM: ---------------------------------------------------------- [~westonpace] my dev list thread was proposing that we should make interval type partially ordered in arrow compute, which is what I am working on at the moment for the new rust implementation: https://github.com/jorgecarleitao/arrow2/pull/398. The reason I proposed that is because I am trying to make the compute behavior compatible with the type semantics defined in the Arrow spec. It would be odd if in the spec we specify that hours in day can vary between 23 to 25 hours due to daylight saving, but always use 24 hours in compute. However, I found partial order semantic is not easy for users to understand due to various edge-cases. For example, should we consider "1 days 22 hours" greater than "2 days -22 hours"? "1 days 23 hours" is not comparable to "2 days" because 2 days could have 50 hours, but should we consider "1 days 50 hours" greater than "2 days"? I also like the Joda time approach you mentioned in https://lists.apache.org/thread.html/rb7c2f111c4fb07ca7a0182f5608cf1380e6daabc05846e8503c1a7c3%40%3Cdev.arrow.apache.org%3E. Making interval type totally unordered and require users to use it together with timestamp for ordering makes everything really easy to understand. For datafusion, we will go with postgres's approach because it aims to be postgres compatible. This is not a problem for datafusion SQL interface because we never said the SQL types maps one to one to Arrow types. In order words, Arrow interval type semantic is an implementation detail that's hidden from the users. The consequence of postgres's behavior is we won't be able to simply hash interval types by their physical bytes. We will need to normalize them first, i.e. "1 days 24 days" and "2 days" should result in the same hash key in hash aggregate and hash join compute kernels. Regardless which way we go, I think it would be good for all Arrow compute implementations to have the same consistent behavior. I am not familiar with the CPP code base, so please correct me if I am wrong. [~cpcloud] I believe the https://github.com/apache/arrow/pull/10960 focuses on computing the interval from two timestamps, but not ordering between intervals? was (Author: houqp): [~westonpace] my dev list thread was proposing that we should make interval type partially ordered in arrow compute, which is what I am working on at the moment for the new rust implementation: https://github.com/jorgecarleitao/arrow2/pull/398. The reason I proposed that is because I am trying to make the compute behavior compatible with the type semantics defined in the Arrow spec. It would be odd if in the spec we specify that hours in day can vary between 23 to 25 hours due to daylight saving, but always use 24 hours in compute. However, I found partial order semantic is not easy for users to understand due to various edge-cases. For example, should we consider "1 days 22 hours" greater than "2 days -22 hours"? Or should we consider "1 days 50 hours" greater than "2 days"? I also like the Joda time approach you mentioned in https://lists.apache.org/thread.html/rb7c2f111c4fb07ca7a0182f5608cf1380e6daabc05846e8503c1a7c3%40%3Cdev.arrow.apache.org%3E. Making interval type totally unordered and require users to use it together with timestamp for ordering makes everything really easy to understand. For datafusion, we will go with postgres's approach because it aims to be postgres compatible. This is not a problem for datafusion SQL interface because we never said the SQL types maps one to one to Arrow types. In order words, Arrow interval type semantic is an implementation detail that's hidden from the users. The consequence of postgres's behavior is we won't be able to simply hash interval types by their physical bytes. We will need to normalize them first, i.e. "1 days 24 days" and "2 days" should result in the same hash key in hash aggregate and hash join compute kernels. Regardless which way we go, I think it would be good for all Arrow compute implementations to have the same consistent behavior. I am not familiar with the CPP code base, so please correct me if I am wrong. [~cpcloud] I believe the https://github.com/apache/arrow/pull/10960 focuses on computing the interval from two timestamps, but not ordering between intervals? > [C++] interval comparison kernels > --------------------------------- > > Key: ARROW-14122 > URL: https://issues.apache.org/jira/browse/ARROW-14122 > Project: Apache Arrow > Issue Type: Sub-task > Reporter: Phillip Cloud > Priority: Major > Labels: kernel > > Subtask for tracking interval comparison kernels -- This message was sent by Atlassian Jira (v8.3.4#803005)