Hi,
I am using pig 0.14 to work on partitioned orc file. I tried to improve
my pig performance. However I am curious why using filter at the
beginning (approach 1) does not help and takes even longer times than
replicated join (approach 2). This filter is supposed to cut down a lot
of data to be taken from orc file. Is this related to how I partition
the orc file? Any guidelines/suggestions are appreciated.
---------------
Approach 1
---------------
coordinate = LOAD 'coordinate' USING
org.apache.hive.hcatalog.pig.HCatLoader();
coordinate_zone = FILTER coordinate BY zone == 2;
....
coordinate_xy = LIMIT coordinate_zone 1;
rawdata_u = LOAD 'u' USING org.apache.hive.hcatalog.pig.HCatLoader();
rawdata_u_1 = foreach rawdata_u generate
date,hh,(double)xlong_u,(double)xlat_u,height,u,zone,year,month;
u_filter = FILTER rawdata_u_1 by zone == 2;
/**** HERE I try to filter and expect to get better performance, but it
is not ****/
u_filter = FILTER u_filter by xlong_u == coordinate_xy.xlong_u and
xlat_u == coordinate_xy.xlat_u;
---------------
Approach 2
---------------
coordinate = LOAD 'coordinate' USING
org.apache.hive.hcatalog.pig.HCatLoader();
coordinate_zone = FILTER coordinate BY zone == 2;
....
coordinate_xy = LIMIT coordinate_zone 1;
rawdata_u = LOAD 'u' USING org.apache.hive.hcatalog.pig.HCatLoader();
u_filter = FILTER rawdata_u by zone == 2
join_u_coordinate_cossin = join u_filter by (xlong_u, xlat_u),
coordinate_xy by (xlong_u, xlat_u) USING 'replicated';
Best,
Patcharee