[ https://issues.apache.org/jira/browse/HIVE-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492544#comment-17492544 ]
Ayush Saxena commented on HIVE-25958: ------------------------------------- Hi [~rajesh.balamohan] Was just casually exploring this. Do you mean to say, we should have a TPE and submit files to processed concurrently? Something like: [https://github.com/ayushtkn/hive/commit/7ae3064e0ba0fe3ca5fab94f016dd970c7c603ad#diff-d90cfdde6fd1ec02081c56152ca143bba5a4da5a0792a875043566eee85b9297R231-R244] in [BasicStatsNoJobTask.java|https://github.com/ayushtkn/hive/commit/7ae3064e0ba0fe3ca5fab94f016dd970c7c603ad#diff-d90cfdde6fd1ec02081c56152ca143bba5a4da5a0792a875043566eee85b9297] , just exploring this part of code, got the class name from the trace. Please ignore if I misunderstood the problem. :) > Optimise BasicStatsNoJobTask > ---------------------------- > > Key: HIVE-25958 > URL: https://issues.apache.org/jira/browse/HIVE-25958 > Project: Hive > Issue Type: Improvement > Reporter: Rajesh Balamohan > Priority: Major > > When there are large number of files are present, it takes lot of time for > analyzing table (for stats) takes lot longer time especially on cloud > platforms. Each file is read in sequential fashion for computing stats, which > can be optimized. > > {code:java} > at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:293) > at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:506) > - locked <0x0000000642995b10> (a org.apache.hadoop.fs.s3a.S3AInputStream) > at > org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:775) > - locked <0x0000000642995b10> (a org.apache.hadoop.fs.s3a.S3AInputStream) > at > org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:116) > at > org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:574) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:282) > at > org.apache.orc.impl.RecordReaderImpl.readAllDataStreams(RecordReaderImpl.java:1172) > at > org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1128) > at > org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1281) > at > org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1316) > at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:302) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:68) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:83) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.createReaderFromFile(OrcInputFormat.java:367) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.<init>(OrcInputFormat.java:276) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:2027) > at > org.apache.hadoop.hive.ql.stats.BasicStatsNoJobTask$FooterStatCollector.run(BasicStatsNoJobTask.java:235) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > "HiveServer2-Background-Pool: Thread-5161" #5161 prio=5 os_prio=0 > tid=0x00007f271217d800 nid=0x21b7 waiting on condition [0x00007f26fce88000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000006bee1b3a0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) > at > java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475) > at > org.apache.hadoop.hive.ql.stats.BasicStatsNoJobTask.shutdownAndAwaitTermination(BasicStatsNoJobTask.java:426) > at > org.apache.hadoop.hive.ql.stats.BasicStatsNoJobTask.aggregateStats(BasicStatsNoJobTask.java:338) > at > org.apache.hadoop.hive.ql.stats.BasicStatsNoJobTask.process(BasicStatsNoJobTask.java:121) > at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) > at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:361) > at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:334) > at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:250) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)