Hi Malligarjunan, Pig or Hive, if you are not using Tez, converts the statements or SQL to (multiple) MapReduce job which is launched in the cluster, thus you achieve parallel processing. But if you use s3, you cannot use the core principle of Hadoop, i,e Data Localization. Thus data has to come to process and needs more time and only 1 machine is processing the data. Thus it turns out to be sequential read. That's why it is taking 2 days or more to process the data.
I think going to Pig or Hive is based on use case. That is, if your logic is involved of lot of processing pipelines, lot of transformations and custom algorithms, Pig will be the choice. Hive will be a better choice if the problem you are solving can be answered using a SQL like statements. Thus, I suggest to rethink to use HDFS instead of s3 as I can see your query involves a join and your input data set is relatively large. Otherwise you'll end up running 1 process for very long time. Also it depends on what is the cluster size and machine configuration. -- Suraj Nayak On 11-Jul-2014 10:52 PM, "S Malligarjunan" <[email protected]> wrote: > Hello All, > > I am a newbie to Apache PIG, I would like to know the performance > benchmark of Apache PIG. > > My current requirement is as follows > I have few files in 2 s3 buckets > Each file may have minimum of 1 million records. File data are tab > separated. > Have to compare few columns and filter the records. > > Right now I am using Hive, it is taking more than 2 days to filter the > records. > Please find the hive query below > > INSERT OVERWRITE TABLE cnv_algo3 > SELECT * FROM table1 t1 JOIN table2 t2 > > WHERE unix_timestamp(t2.time, 'yyyy-MM-dd HH:mm:ss,SSS') > > unix_timestamp(t1.time, 'yyyy-MM-dd HH:mm:ss,SSS') > and compare(t1.column1, t1.column2, t2.column1, t2.column4); > > Here compare is the UDF function. > Assume table1 1 has 20 million records and table2 has 5 million records. > Let me know how much time PIG will to take filter the records in a > standard configuration. > > It is pretty urgent to take an decision to move the project to use PIG. > Hence help me. I highly appreciate your help. > > > Thanks and Regards, > Malligarjunan S. >
