Re: Apache PIG performance benchmark

Suraj Nayak Sat, 12 Jul 2014 14:11:26 -0700

Hi Malligarjunan,

Pig or Hive, if you are not using Tez, converts the statements or SQL to
(multiple) MapReduce job which is launched in the cluster, thus you achieve
parallel processing. But if you use s3, you cannot use the core principle
of Hadoop, i,e Data Localization. Thus data has to come to process and
needs more time and only 1 machine is processing the data. Thus it turns
out to be sequential read. That's why it is taking 2 days or more to
process the data.

I think going to Pig or Hive is based on use case. That is, if your logic
is involved of lot of processing pipelines, lot of transformations and
custom algorithms, Pig will be the choice. Hive will be a better choice if
the problem you are solving can be answered using a SQL like statements.

Thus, I suggest to rethink to use HDFS instead of s3 as I can see your
query involves a join and your input data set is relatively large.
Otherwise you'll end up running 1 process for very long time.

Also it depends on what is the cluster size and machine configuration.

--
Suraj Nayak
On 11-Jul-2014 10:52 PM, "S Malligarjunan" <[email protected]>
wrote:

> Hello All,
>
> I am a newbie to Apache PIG, I would like to know the performance
> benchmark of Apache PIG.
>
> My current requirement is as follows
> I have few files in 2 s3 buckets
> Each file may have minimum of 1 million records. File data are tab
> separated.
> Have to compare few columns and filter the records.
>
> Right now I am using Hive, it is taking more than 2 days to filter the
> records.
> Please find the hive query below
>
> INSERT OVERWRITE TABLE cnv_algo3
> SELECT * FROM table1 t1 JOIN table2 t2
>
>   WHERE unix_timestamp(t2.time, 'yyyy-MM-dd HH:mm:ss,SSS') >
> unix_timestamp(t1.time, 'yyyy-MM-dd HH:mm:ss,SSS')
> and compare(t1.column1, t1.column2, t2.column1, t2.column4);
>
> Here compare is the UDF function.
> Assume table1 1 has 20 million records and table2 has 5 million records.
> Let me know how much time PIG will to take filter the records in a
> standard configuration.
>
> It is pretty urgent to take an decision to move the project to use PIG.
> Hence help me. I highly appreciate your help.
>
>
> Thanks and Regards,
> Malligarjunan S.
>

Re: Apache PIG performance benchmark

Reply via email to