I am comparing runtime of similar logic. The entire logic is exactly same but surprisingly map reduce job that I submit is 100x slow. For pig I use udf and for hadoop I use mapper only and the logic same as pig. Even the splits on the admin page are same. Not sure why it's so slow. I am submitting job like:
java -classpath .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar com.services.dp.analytics.hadoop.mapred.FormMLProcessor /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq /examples/output1/ How should I go about looking the root cause of why it's so slow? Any suggestions would be really appreciated. One of the things I noticed is that on the admin page of map task list I see status as "hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728" but for pig the status is blank.