pvary commented on a change in pull request #3053:
URL: https://github.com/apache/hive/pull/3053#discussion_r814583976
##########
File path:
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreChecker.java
##########
@@ -303,56 +304,132 @@ void checkTable(Table table, PartitionIterable parts,
byte[] filterExp, CheckRes
if (tablePath == null) {
return;
}
- FileSystem fs = tablePath.getFileSystem(conf);
- if (!fs.exists(tablePath)) {
+ final FileSystem[] fs = {tablePath.getFileSystem(conf)};
+ if (!fs[0].exists(tablePath)) {
result.getTablesNotOnFs().add(table.getTableName());
return;
}
Set<Path> partPaths = new HashSet<>();
- // check that the partition folders exist on disk
- for (Partition partition : parts) {
- if (partition == null) {
- // most likely the user specified an invalid partition
- continue;
- }
- Path partPath = getDataLocation(table, partition);
- if (partPath == null) {
- continue;
- }
- fs = partPath.getFileSystem(conf);
+ int threadCount = MetastoreConf.getIntVar(conf,
MetastoreConf.ConfVars.METASTORE_MSCK_FS_HANDLER_THREADS_COUNT);
+
+ final ExecutorService pool = (threadCount > 1) ?
+ Executors.newFixedThreadPool(threadCount,
+ new ThreadFactoryBuilder()
+ .setDaemon(true)
+ .setNameFormat("CheckTable-PartitionOptimizer-%d").build()) :
null;
- CheckResult.PartitionResult prFromMetastore = new
CheckResult.PartitionResult();
- prFromMetastore.setPartitionName(getPartitionName(table, partition));
- prFromMetastore.setTableName(partition.getTableName());
- if (!fs.exists(partPath)) {
- result.getPartitionsNotOnFs().add(prFromMetastore);
+ try {
+ Queue<Future<String>> futures = new LinkedList<>();
+ if (pool != null) {
+ // check that the partition folders exist on disk using multi-thread
+ for (Partition partition : parts) {
Review comment:
I think this will fetch all of the partitions from the partition
iterator immediately and keep them in memory.
The goal was with the partition iterator to prevent OOM when there are big
tables with huge number of partitions. We do not want every partition in the
memory once, so the iterator fetched them in batches, and after we did not use
them we let the GC take care of the batch.
With this change I expect that we create a `Future` immediately for all of
the partitions and we will keep all of the partitions in memory until all of
the checks are finished.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]