Daniel Dai created PIG-4679: ------------------------------- Summary: Performance degradation due to InputSizeReducerEstimator since PIG-3754 Key: PIG-4679 URL: https://issues.apache.org/jira/browse/PIG-4679 Project: Pig Issue Type: Bug Components: impl Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.16.0
On encountering a non-HDFS location in the input (for example a JOIN involving both HBase tables and intermediate temp files), Pig 0.14 ReducerEstimator is returning total input size as -1 (unknown) where as in Pig 0.12.1 it was returning the sum of temp file sizes as the total size. Since -1 is returned as the input size, Pig end up using only one reducer for the job. STEPS TO REPRODUCE: 1. Create an HBase table with enough data. Using PerformanceEvaluation tool to generate data {code:java} hbase org.apache.hadoop.hbase.PerformanceEvaluation --presplit=20 --rows=1000000 sequentialWrite 10 {code} 2. Dump the table data into a file which we can then use in a Pig JOIN. Following Pig script generates the data file {code:java} $ pig A = LOAD 'hbase://TestTable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:data', '-loadKey') AS (row_key: chararray, data: chararray); STORE A INTO 'hdfs:///tmp/re_test/test_table_data' USING PigStorage('|'); {code} 3. Check file size to make sure that it is more than 1,000,000,000 which is the default bytes per reducer Pig configuration {code:java} $ hdfs dfs -count hdfs:///tmp/re_test/test_table_data QA: 1 41 10280000000 hdfs:///tmp/re_test/test_table_data PROD: 1 57 10280000000 hdfs:///tmp/re_test/test_table_data {code} 4. Run a Pig script that joins the HBase table with the data file. QA and PROD will use different number of reducers. QA (176243) should run 1 reducer and PROD (176258) should run 11 reducers (10,280,000,000 / 1,000,000,000) {code:java} $ pig A = LOAD 'hbase://TestTable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:data', '-loadKey') AS (row_key: chararray, data: chararray); B = LOAD 'hdfs:///tmp/re_test/test_table_data' USING PigStorage('|') AS (row_key: chararray, data: chararray); C = JOIN A BY row_key, B BY row_key; STORE C INTO 'hdfs:///tmp/re_test/test_table_data_join' USING PigStorage('|'); {code} Pig 0.12.1 ran 11 reduce, Pig 0.13+ run only 1 reduce. -- This message was sent by Atlassian JIRA (v6.3.4#6332)