1. Firstly we create a hive table by loading following csv file > $cat data.csv > > ID,City,Zip,Flag > 1,A,95126,0 > 2,A,95126,1 > 3,A,95126,1 > 4,B,95124,0 > 5,B,95124,1 > 6,C,95124,0 > 7,C,95127,1 > 8,C,95127,0 > 9,C,95127,1
(a) where "ID" above is a primary key (unique), (b) for each "City" and "Zip" combination, there is one ID in max with Flag=0; while it can contain multiple IDs with Flag=1 for each "City" and "Zip" combination. (c) Flag can be 0 or 1 2. Load the csv file above to a hive table > create table test(ID string, City String, Zip String, Flag int) > ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ; > > LOAD DATA LOCAL INPATH "./data.csv" OVERWRITE INTO TABLE test; 3. For each ID with Flag=0, we want to pair it with another ID with Flag=1 but with the same City - Zip. If one cannot find another paired ID with Flag=1 and matched City - Zip, we just delete that record. Here is the Expected result: > ID,City,Zip,Flag > 1,A,95126,0 > 2,A,95126,1 > 4,B,95124,0 > 5,B,95124,1 > 7,C,95127,1 > 8,C,95127,0 Any valuable tips how to do this pairing in Hive or Python? Great thanks! Rex
