Hello All, I am a newbie to Apache PIG, I would like to know the performance benchmark of Apache PIG.
My current requirement is as follows I have few files in 2 s3 buckets Each file may have minimum of 1 million records. File data are tab separated. Have to compare few columns and filter the records. Right now I am using Hive, it is taking more than 2 days to filter the records. Please find the hive query below INSERT OVERWRITE TABLE cnv_algo3 SELECT * FROM table1 t1 JOIN table2 t2 WHERE unix_timestamp(t2.time, 'yyyy-MM-dd HH:mm:ss,SSS') > unix_timestamp(t1.time, 'yyyy-MM-dd HH:mm:ss,SSS') and compare(t1.column1, t1.column2, t2.column1, t2.column4); Here compare is the UDF function. Assume table1 1 has 20 million records and table2 has 5 million records. Let me know how much time PIG will to take filter the records in a standard configuration. It is pretty urgent to take an decision to move the project to use PIG. Hence help me. I highly appreciate your help. Thanks and Regards, Malligarjunan S.
