[ https://issues.apache.org/jira/browse/PIG-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cheolsoo Park updated PIG-4241: ------------------------------- Status: Patch Available (was: Open) > Auto local mode mistakenly converts large jobs to local mode when using with > Hive tables > ---------------------------------------------------------------------------------------- > > Key: PIG-4241 > URL: https://issues.apache.org/jira/browse/PIG-4241 > Project: Pig > Issue Type: Bug > Components: impl > Reporter: Cheolsoo Park > Assignee: Cheolsoo Park > Fix For: 0.15.0 > > Attachments: PIG-4241-1.patch > > > The current implementation of auto local mode has two severe problems- > # It assumes file-based inputs, and it always converts jobs with > non-file-based inputs into local mode unless the > {{LoadMetadata.getStatistics().getSizeInBytes()}} returns >100M. This is > particularly problematic when using Pig with Hive tables with custom > LoadFuncs that did not implement LoadMetadata interface. > # It lists all the files to compute the total size. The algorithm is like > this. First, compute the total size. Second, compare it against the > configured max bytes. This is very time-consuming when Pig job loads a large > number of files. It will list all the files only to compute the total size. > Instead, we should stop computing the sum of input sizes as soon as it > becomes the max bytes- > {code:title=JobControlCompiler.java} > long totalInputFileSize = > InputSizeReducerEstimator.getTotalInputFileSize(conf, lds, job); // THIS IS > BAD! > long inputByteMax = > conf.getLong(PigConfiguration.PIG_AUTO_LOCAL_INPUT_MAXBYTES, 100*1000*1000l); > log.info("Size of input: " + totalInputFileSize +" bytes. Small job > threshold: " + inputByteMax ); > if (totalInputFileSize < 0 || totalInputFileSize > inputByteMax) { > return false; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)