As you noticed your partition by clause is improper since your row_number()
counts all of the members that are the same -- instead of sequencing a run
of the same values.
Here's a POC using a simplified version of your dataset:
You can adjust the grouping windows and change detection accordingly.
CREATE TABLE foo
(
dt string,
flag int
);
insert into foo
values
('2017-07-12 22:59:17',1),
('2017-07-12 23:02:14',0),
('2017-07-12 23:07:40',0),
('2017-07-12 23:12:41',0),
('2017-07-12 23:22:06',0),
('2017-07-12 23:38:35',0),
('2017-07-12 23:44:19',0),
('2017-07-12 23:47:49',1),
('2017-07-12 23:48:49',1),
('2017-07-12 23:53:31',0),
('2017-07-12 23:57:01',1),
('2017-07-13 00:03:10',1),
('2017-07-13 00:06:35',0),
('2017-07-13 00:07:29',1),
('2017-07-12 12:59:17',1),
('2017-07-12 13:02:14',0),
('2017-07-12 13:07:40',0),
('2017-07-12 13:12:41',0);
with cte
AS
(
select
dt,
flag,
lag(flag) over(order by dt) AS prev_flag,
(case
when flag <> lag(flag,1) over(order by dt) then 1
else 0
end) as changed_indicator
from
foo as x
),
cte2
AS
(
select
*,
sum(changed_indicator) over(order by dt) AS grouping_key
from
cte
)
select
*,
row_number() over(partition by grouping_key order by dt) as seq
from
cte2
order by
dt;
+---------------------+------+-----------+-------------------+--------------+-----+
| dt | flag | prev_flag | changed_indicator | grouping_key
| seq |
+---------------------+------+-----------+-------------------+--------------+-----+
| 2017-07-12 12:59:17 | 1 | NULL | 0 | 0
| 1 |
| 2017-07-12 13:02:14 | 0 | 1 | 1 | 1
| 1 |
| 2017-07-12 13:07:40 | 0 | 0 | 0 | 1
| 2 |
| 2017-07-12 13:12:41 | 0 | 0 | 0 | 1
| 3 |
| 2017-07-12 22:59:17 | 1 | 0 | 1 | 2
| 1 |
| 2017-07-12 23:02:14 | 0 | 1 | 1 | 3
| 1 |
| 2017-07-12 23:07:40 | 0 | 0 | 0 | 3
| 2 |
| 2017-07-12 23:12:41 | 0 | 0 | 0 | 3
| 3 |
| 2017-07-12 23:22:06 | 0 | 0 | 0 | 3
| 4 |
| 2017-07-12 23:38:35 | 0 | 0 | 0 | 3
| 5 |
| 2017-07-12 23:44:19 | 0 | 0 | 0 | 3
| 6 |
| 2017-07-12 23:47:49 | 1 | 0 | 1 | 4
| 1 |
| 2017-07-12 23:48:49 | 1 | 1 | 0 | 4
| 2 |
| 2017-07-12 23:53:31 | 0 | 1 | 1 | 5
| 1 |
| 2017-07-12 23:57:01 | 1 | 0 | 1 | 6
| 1 |
| 2017-07-13 00:03:10 | 1 | 1 | 0 | 6
| 2 |
| 2017-07-13 00:06:35 | 0 | 1 | 1 | 7
| 1 |
| 2017-07-13 00:07:29 | 1 | 0 | 1 | 8
| 1 |
+---------------------+------+-----------+-------------------+--------------+-----+
On Tue, Jul 17, 2018 at 11:08 PM Anup Tiwari <[email protected]> wrote:
> Can someone look into this and revert if possible?
>
>
> Regards,
> Anup Tiwari
>
> On Sat, Jul 14, 2018 at 12:28 AM, Anup Tiwari <[email protected]>
> wrote:
>
>> Hi All,
>>
>> Can someone look into this and revert if possible?
>>
>> Thanks.
>>
>>
>> On Thu, 12 Jul 2018 12:56 Anup Tiwari, <[email protected]> wrote:
>>
>>> Hi All,
>>>
>>> We have a use case where we want to assign a row number to a table based
>>> on 3 column ( uid, update_date, flag) i.e. if value of any of the
>>> column gets changed, we want to reset this number. Please find below sample
>>> input data and expected output data.
>>>
>>> Also please note that we have tried row_number() over(partition by uid,
>>> update_date, flag order by update_time asc) but due to this actual
>>> input ordering got break due to I believe partition by clause because it
>>> seems partition by creates group within column specified and then it start
>>> row number and due to this actual ordering is breaking. So i just wanted to
>>> know that is there any function available in hive which can give us below
>>> result OR we are missing something in window function?
>>>
>>>
>>> *Input Data :- *
>>>
>>> *uid* *update_date* *update_time* *flag*
>>> 468730 2017-07-12 12/07/2017 22:59:17 1
>>> 468730 2017-07-12 12/07/2017 23:02:14 0
>>> 468730 2017-07-12 12/07/2017 23:07:40 0
>>> 468730 2017-07-12 12/07/2017 23:12:41 0
>>> 468730 2017-07-12 12/07/2017 23:22:06 0
>>> 468730 2017-07-12 12/07/2017 23:38:35 0
>>> 468730 2017-07-12 12/07/2017 23:44:19 0
>>> 468730 2017-07-12 12/07/2017 23:47:49 1
>>> 468730 2017-07-12 12/07/2017 23:48:49 1
>>> 468730 2017-07-12 12/07/2017 23:53:31 0
>>> 468730 2017-07-12 12/07/2017 23:57:01 1
>>> 468730 2017-07-13 13/07/2017 00:03:10 1
>>> 468730 2017-07-13 13/07/2017 00:06:35 0
>>> 468730 2017-07-13 13/07/2017 00:07:29 1
>>> 468731 2017-07-13 12/07/2017 12:59:17 1
>>> 468731 2017-07-13 12/07/2017 13:02:14 0
>>> 468731 2017-07-13 12/07/2017 13:07:40 0
>>> 468731 2017-07-13 12/07/2017 13:12:41 0
>>>
>>>
>>> *Output Data :-*
>>>
>>> *uid* *update_date* *update_time* *flag* *required_row_num*
>>> 468730 2017-07-12 12/07/2017 22:59:17 1 1
>>> 468730 2017-07-12 12/07/2017 23:02:14 0 1
>>> 468730 2017-07-12 12/07/2017 23:07:40 0 2
>>> 468730 2017-07-12 12/07/2017 23:12:41 0 3
>>> 468730 2017-07-12 12/07/2017 23:22:06 0 4
>>> 468730 2017-07-12 12/07/2017 23:38:35 0 5
>>> 468730 2017-07-12 12/07/2017 23:44:19 0 6
>>> 468730 2017-07-12 12/07/2017 23:47:49 1 1
>>> 468730 2017-07-12 12/07/2017 23:48:49 1 2
>>> 468730 2017-07-12 12/07/2017 23:53:31 0 1
>>> 468730 2017-07-12 12/07/2017 23:57:01 1 1
>>> 468730 2017-07-13 13/07/2017 00:03:10 1 1
>>> 468730 2017-07-13 13/07/2017 00:06:35 0 1
>>> 468730 2017-07-13 13/07/2017 00:07:29 1 1
>>> 468731 2017-07-13 12/07/2017 12:59:17 1 1
>>> 468731 2017-07-13 12/07/2017 13:02:14 0 1
>>> 468731 2017-07-13 12/07/2017 13:07:40 0 2
>>> 468731 2017-07-13 12/07/2017 13:12:41 0 3
>>> *FYI :* We are one Hive 2.3.1.
>>>
>>
>