Top of head select *from (Select ID, flag, lead(id) over(partition by city,zip order by flag,ID) c from t) Where id==0 and c is not null
Should do it. Basically you want to keep records which has ID 0 and have a corresponding 1. Please let me know if doesn't work, so I can provide a right solution. On 26 Aug 2016 11:00, "Rex X" <dnsr...@gmail.com> wrote: > 1. Given following CSV file > > > $cat data.csv > > > > ID,City,Zip,Flag > > 1,A,95126,0 > > 2,A,95126,1 > > 3,A,95126,1 > > 4,B,95124,0 > > 5,B,95124,1 > > 6,C,95124,0 > > 7,C,95127,1 > > 8,C,95127,0 > > 9,C,95127,1 > > > (a) where "ID" above is a primary key (unique), > > (b) for each "City" and "Zip" combination, there is one ID in max with > Flag=0; while it can contain multiple IDs with Flag=1 for each "City" and > "Zip" combination. > > (c) Flag can be 0 or 1 > > > 2. For each ID with Flag=0, we want to pair it with another ID with > Flag=1 but with the same City - Zip. If one cannot find another paired ID > with Flag=1 and matched City - Zip, we just delete that record. > > Here is the expected result: > > > ID,City,Zip,Flag > > 1,A,95126,0 > > 2,A,95126,1 > > 4,B,95124,0 > > 5,B,95124,1 > > 7,C,95127,1 > > 8,C,95127,0 > > > Any valuable tips how to do this pairing in Python or Scala? > > Great thanks! > > Rex >