Mostafa Mokhtar created HIVE-8292:
-------------------------------------
Summary: Reading from partitioned bucketed tables has high
overhead in MapOperator.cleanUpInputFileChangedOp
Key: HIVE-8292
URL: https://issues.apache.org/jira/browse/HIVE-8292
Project: Hive
Issue Type: Bug
Components: Query Processor
Affects Versions: 0.14.0
Environment: cn105
Reporter: Mostafa Mokhtar
Assignee: Owen O'Malley
Fix For: 0.14.0
Reading from bucketed partitioned tables has significantly higher overhead
compared to non-bucketed non-partitioned files.
50% of the time is spent in these two lines of code in
OrcInputFormate.getReader()
{code}
String txnString = conf.get(ValidTxnList.VALID_TXNS_KEY,
Long.MAX_VALUE + ":");
ValidTxnList validTxnList = new ValidTxnListImpl(txnString);
{code}
{code}
Stack Trace Sample Count Percentage(%)
hive.ql.exec.tez.MapRecordSource.pushRecord() 2,981 87.215
org.apache.tez.mapreduce.lib.MRReaderMapred.next() 2,002 58.572
mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(Object,
Object) 2,002 58.572
mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader()
1,984 58.046
hive.ql.io.HiveInputFormat.getRecordReader(InputSplit, JobConf,
Reporter) 1,983 58.016
hive.ql.io.orc.OrcInputFormat.getRecordReader(InputSplit, JobConf, Reporter)
1,891 55.325
hive.ql.io.orc.OrcInputFormat.getReader(InputSplit,
AcidInputFormat$Options) 1,723 50.41
hive.common.ValidTxnListImpl.<init>(String)
934 27.326
conf.Configuration.get(String, String) 621
18.169
{code}
Another 20% of the profile is spent in MapOperator.cleanUpInputFileChangedOp
5% the CPU in
{code}
Path onepath = normalizePath(onefile);
{code}
And
15% the CPU in
{code}
onepath.toUri().relativize(fpath.toUri()).equals(fpath.toUri());
{code}
>From the profiler
{code}
Stack Trace Sample Count Percentage(%)
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(Object) 978
28.613
org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(Writable)
978 28.613
org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged() 866
25.336
org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp()
866 25.336
java.net.URI.relativize(URI) 655 19.163
java.net.URI.relativize(URI, URI) 655 19.163
java.net.URI.normalize(String) 517 15.126
java.net.URI.needsNormalization(String)
372 10.884
java.lang.String.charAt(int) 235
6.875
java.net.URI.equal(String, String) 27 0.79
java.lang.StringBuilder.toString() 1 0.029
java.lang.StringBuilder.<init>() 1 0.029
java.lang.StringBuilder.append(String) 1 0.029
org.apache.hadoop.hive.ql.exec.MapOperator.normalizePath(String) 167
4.886
org.apache.hadoop.fs.Path.<init>(String) 162 4.74
org.apache.hadoop.fs.Path.initialize(String, String, String, String) 162
4.74
org.apache.hadoop.fs.Path.normalizePath(String, String) 97 2.838
org.apache.commons.lang.StringUtils.replace(String, String, String)
97 2.838
org.apache.commons.lang.StringUtils.replace(String, String,
String, int) 97 2.838
java.lang.String.indexOf(String, int) 97 2.838
java.net.URI.<init>(String, String, String, String, String)
65 1.902
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)