Hi,

I am using pig 0.14 to work on partitioned orc file. I tried to improve my pig performance. However I am curious why using filter at the beginning (approach 1) does not help and takes even longer times than replicated join (approach 2). This filter is supposed to cut down a lot of data to be taken from orc file. Is this related to how I partition the orc file? Any guidelines/suggestions are appreciated.

---------------
Approach 1
---------------
coordinate = LOAD 'coordinate' USING org.apache.hive.hcatalog.pig.HCatLoader();
coordinate_zone = FILTER coordinate BY zone == 2;
....
coordinate_xy = LIMIT coordinate_zone 1;

rawdata_u = LOAD 'u' USING org.apache.hive.hcatalog.pig.HCatLoader();
rawdata_u_1 = foreach rawdata_u generate date,hh,(double)xlong_u,(double)xlat_u,height,u,zone,year,month;
u_filter = FILTER rawdata_u_1 by zone == 2;

/**** HERE I try to filter and expect to get better performance, but it is not ****/ u_filter = FILTER u_filter by xlong_u == coordinate_xy.xlong_u and xlat_u == coordinate_xy.xlat_u;

---------------
Approach 2
---------------
coordinate = LOAD 'coordinate' USING org.apache.hive.hcatalog.pig.HCatLoader();
coordinate_zone = FILTER coordinate BY zone == 2;
....
coordinate_xy = LIMIT coordinate_zone 1;

rawdata_u = LOAD 'u' USING org.apache.hive.hcatalog.pig.HCatLoader();
u_filter = FILTER rawdata_u by zone == 2
join_u_coordinate_cossin = join u_filter by (xlong_u, xlat_u), coordinate_xy by (xlong_u, xlat_u) USING 'replicated';


Best,
Patcharee

Reply via email to