[GitHub] [hive] rbalamohan commented on a change in pull request #3037: HIVE-25958: Optimise BasicStatsNoJobTask.

GitBox Mon, 21 Feb 2022 21:32:32 -0800


rbalamohan commented on a change in pull request #3037:
URL: https://github.com/apache/hive/pull/3037#discussion_r811588175




##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##########
@@ -223,6 +227,16 @@ public void run() {
         } else {
           fileList = HiveStatsUtils.getFileStatusRecurse(dir, -1, fs);
         }
+        ThreadPoolExecutor tpE = null;
+        ArrayList<Future<FileStats>> futures = null;

Review comment:
       List instead?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##########
@@ -223,6 +227,16 @@ public void run() {
         } else {
           fileList = HiveStatsUtils.getFileStatusRecurse(dir, -1, fs);
         }
+        ThreadPoolExecutor tpE = null;
+        ArrayList<Future<FileStats>> futures = null;
+        int numThreads = HiveConf.getIntVar(jc, 
HiveConf.ConfVars.BASICSTATSTASKSMAXTHREADS);
+        if (fileList.size() > 1 && numThreads > 1) {
+          numThreads = Math.max(fileList.size(), numThreads);

Review comment:
       Limit to 2*available processors? (i.e instead of fileList.size(). In 
case file listing is 1k, it shouldn't spin up 1k threads).

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##########
@@ -232,28 +246,33 @@ public void run() {
             if (file.getLen() == 0) {
               numFiles += 1;
             } else {
-              org.apache.hadoop.mapred.RecordReader<?, ?> recordReader = 
inputFormat.getRecordReader(dummySplit, jc, Reporter.NULL);
-              try {
-                if (recordReader instanceof StatsProvidingRecordReader) {
-                  StatsProvidingRecordReader statsRR;
-                  statsRR = (StatsProvidingRecordReader) recordReader;
-                  rawDataSize += statsRR.getStats().getRawDataSize();
-                  numRows += statsRR.getStats().getRowCount();
-                  fileSize += file.getLen();
-                  numFiles += 1;
-                  if (file.isErasureCoded()) {
-                    numErasureCodedFiles++;
-                  }
-                } else {
-                  throw new HiveException(String.format("Unexpected file found 
during reading footers for: %s ", file));
-                }
-              } finally {
-                recordReader.close();
+              FileStatProcessor fsp = new FileStatProcessor(file, inputFormat, 
dummySplit, jc);
+              if (tpE != null) {
+                futures.add(tpE.submit(fsp));

Review comment:
       Add exception handling? (e.g kill/cancel other tasks on any exception 
from other tasks)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hive] rbalamohan commented on a change in pull request #3037: HIVE-25958: Optimise BasicStatsNoJobTask.

Reply via email to