pig performance on reading/filtering orc file

patcharee Fri, 29 May 2015 01:36:07 -0700

Hi,

I am using pig 0.14 to work on partitioned orc file. I tried to improvemy pig performance. However I am curious why using filter at thebeginning (approach 1) does not help and takes even longer times thanreplicated join (approach 2). This filter is supposed to cut down a lotof data to be taken from orc file. Is this related to how I partitionthe orc file? Any guidelines/suggestions are appreciated.


---------------
Approach 1
---------------

coordinate = LOAD 'coordinate' USINGorg.apache.hive.hcatalog.pig.HCatLoader();

coordinate_zone = FILTER coordinate BY zone == 2;
....
coordinate_xy = LIMIT coordinate_zone 1;

rawdata_u = LOAD 'u' USING org.apache.hive.hcatalog.pig.HCatLoader();

rawdata_u_1 = foreach rawdata_u generatedate,hh,(double)xlong_u,(double)xlat_u,height,u,zone,year,month;

u_filter = FILTER rawdata_u_1 by zone == 2;

/**** HERE I try to filter and expect to get better performance, but itis not ****/u_filter = FILTER u_filter by xlong_u == coordinate_xy.xlong_u andxlat_u == coordinate_xy.xlat_u;


---------------
Approach 2
---------------

coordinate = LOAD 'coordinate' USINGorg.apache.hive.hcatalog.pig.HCatLoader();

coordinate_zone = FILTER coordinate BY zone == 2;
....
coordinate_xy = LIMIT coordinate_zone 1;

rawdata_u = LOAD 'u' USING org.apache.hive.hcatalog.pig.HCatLoader();
u_filter = FILTER rawdata_u by zone == 2

join_u_coordinate_cossin = join u_filter by (xlong_u, xlat_u),coordinate_xy by (xlong_u, xlat_u) USING 'replicated';



Best,
Patcharee

pig performance on reading/filtering orc file

Reply via email to