[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899605#action_12899605 ]
Yan Zhou commented on PIG-1518: ------------------------------- One experimental result on a 15-node cluster of 2 x Xeon L5420 2.50GHz/16G RAM boxes is as follows: Query: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (double)estimated_revenue; B1 = distinct B; alpha = load '/user/pig/tests/data/pigmix/users' using PigStorage('\u0001') as (name, phone, address, city, state, zip); beta = foreach alpha generate name; C = join beta by name, B1 by user parallel 300; D = group C by $0 parallel 40; E = foreach D generate group, SUM(C.estimated_revenue); store E into 'spliCombo2.out'; It creates 3 map/reduce jobs. No Split Combination: ||Mappers|Reducers| |number|120|300| |elapsed time|24s|2m43s| |number|301|300| |elapsed time|46s|3m11s| |number|300|40| |elapsed time|38s|53s| |Total elapsed time|7m36s| With Split Combination: ||mappers|Reducers| |number|120|300| |elapsed time|22s|2m49s| |number|3|300| |elapsed time|27s|2m46s| |number|1|40| |elapsed time|17s|24s| |Total elapsed time|7m5s| > multi file input format for loaders > ----------------------------------- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement > Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.