Leveraging Pig effectively

Sarath Sat, 07 Apr 2012 02:39:39 -0700

Dear All,

I have 2 data dumps (comma separated) each with around 53,000 records (just sample data. it could be 10times more than this in real time).

I need to write a script to -

1. find matching records from these 2 dumps based on a set of matchingfields

 2. store matching records from each dump into database
 3. find the remaining records from each dump
 4. find matching records by excluding one of the matching field
 5. again store matching records from each dump into database

For step 1 I used "cogroup"

For step 3 I split "cogroup" with nulls for dumps 2 & 1 respectively toget the remaining records for dumps 1 & 2For step 2 & 4 I used DBStorage UDF to store the records into DB. Withthis approach I get 4 store commands (2 commands for each dump at steps2 & 5).

Before storing to DB I'm using another UDF to generate a runningsequence number which will be stored as key for each record being stored.


Problem:
=====

The script for this entire process is creating 6 map-reduce jobs andtaking about 10mins to complete on a cluster of 3 machines (1 master and2 slaves).The same requirement when done using a stored procedure is completing in5 mins. Now I'm worried that my script could kill in real time environments.


Requesting to suggest -
-> What am I missing?

-> What can I do more to improve the performance that is in comparisonto stored procedure?-> What changes and/or additions to be done so that the script isscalable to any amounts of data?


Thanks in advance,
Regards,
Sarath.

Leveraging Pig effectively

Reply via email to