[ https://issues.apache.org/jira/browse/HIVE-21546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
t oo updated HIVE-21546: ------------------------ Affects Version/s: 3.1.1 2.3.4 Component/s: StorageHandler storage-api File Formats > hiveserver2 - “mapred.FileInputFormat: Total input files to process” - why > single threaded? > ------------------------------------------------------------------------------------------- > > Key: HIVE-21546 > URL: https://issues.apache.org/jira/browse/HIVE-21546 > Project: Hive > Issue Type: Bug > Components: File Formats, storage-api, StorageHandler > Affects Versions: 3.1.1, 2.3.4 > Reporter: t oo > Priority: Major > > I have setup Hive (v2.3.4) on Spark (exec engine, but MR gets same issue), > hadoop 2.7.6 (or hadoop 2.8.5). My external hive table is Parquet format on > s3 across 100s of partitions. Below settings are set to 20: > {\{hive.exec.input.listing.max.threads mapred.dfsclient.parallelism.max > mapreduce.input.fileinputformat.list-status.num-threads }} > Run a simple query: > {\{select * from s.there h_code = 'KGD78' and h_no = '265' }} > I can see the below in HiveServer2 logs (the logs continue for more than 1000 > lines listing all the different partitions). Why is the listing of files not > being done in parallel? It takes more than 5mins just in the listing. > {{2019-03-29T11:29:26,866 INFO [3fa82455-7853-4c4b-8964-847c00bec708 > HiveServer2-Handler-Pool: Thread-53] compress.CodecPool: Got brand-new > decompressor [.snappy] 2019-03-29T11:29:27,283 INFO > [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] > mapred.FileInputFormat: Total input files to process : 1 > 2019-03-29T11:29:27,797 INFO [3fa82455-7853-4c4b-8964-847c00bec708 > HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input > files to process : 1 2019-03-29T11:29:28,374 INFO > [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] > mapred.FileInputFormat: Total input files to process : 1 > 2019-03-29T11:29:28,919 INFO [3fa82455-7853-4c4b-8964-847c00bec708 > HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input > files to process : 1 2019-03-29T11:29:29,483 INFO > [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] > mapred.FileInputFormat: Total input files to process : 1 > 2019-03-29T11:29:30,003 INFO [3fa82455-7853-4c4b-8964-847c00bec708 > HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input > files to process : 1 2019-03-29T11:29:30,518 INFO > [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] > mapred.FileInputFormat: Total input files to process : 1 > 2019-03-29T11:29:31,001 INFO [3fa82455-7853-4c4b-8964-847c00bec708 > HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input > files to process : 1 2019-03-29T11:29:31,549 INFO > [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] > mapred.FileInputFormat: Total input files to process : 1 > 2019-03-29T11:29:32,048 INFO [3fa82455-7853-4c4b-8964-847c00bec708 > HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input > files to process : 1 2019-03-29T11:29:32,574 INFO > [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] > mapred.FileInputFormat: Total input files to process : 1 > 2019-03-29T11:29:33,130 INFO [3fa82455-7853-4c4b-8964-847c00bec708 > HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input > files to process : 1 2019-03-29T11:29:33,639 INFO > [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] > mapred.FileInputFormat: Total input files to process : 1 > 2019-03-29T11:29:34,189 INFO [3fa82455-7853-4c4b-8964-847c00bec708 > HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input > files to process : 1 2019-03-29T11:29:34,743 INFO > [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] > mapred.FileInputFormat: Total input files to process : 1 > 2019-03-29T11:29:35,208 INFO [3fa82455-7853-4c4b-8964-847c00bec708 > HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input > files to process : 1 2019-03-29T11:29:35,701 INFO > [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] > mapred.FileInputFormat: Total input files to process : 1 > 2019-03-29T11:29:36,183 INFO [3fa82455-7853-4c4b-8964-847c00bec708 > HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input > files to process : 1 2019-03-29T11:29:36,662 INFO > [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] > mapred.FileInputFormat: Total input files to process : 1 > 2019-03-29T11:29:37,154 INFO [3fa82455-7853-4c4b-8964-847c00bec708 > HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input > files to process : 1 2019-03-29T11:29:37,645 INFO > [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] > mapred.FileInputFormat: Total input files to process : 1 }} > I've tried > {\{hive.exec.input.listing.max.threads mapred.dfsclient.parallelism.max > mapreduce.input.fileinputformat.list-status.num-threads }} > with defaults, 1, 50...still same result > > > > Hive 3.1.1/hadoop3.1.2 also has the issue: > > 2019-03-29T18:10:15,451 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: at > row 0. reading next block > 2019-03-29T18:10:15,461 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: > block read in memory in 10 ms. row count = 4584 > 2019-03-29T18:10:15,620 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] mapred.FileInputFormat: Total input > files to process : 1 > 2019-03-29T18:10:15,714 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random > IO seek policy > 2019-03-29T18:10:15,757 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random > IO seek policy > 2019-03-29T18:10:15,767 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: > RecordReader initialized will read a total of 4584 records. > 2019-03-29T18:10:15,767 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: at > row 0. reading next block > 2019-03-29T18:10:15,777 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: > block read in memory in 10 ms. row count = 4584 > 2019-03-29T18:10:15,984 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] mapred.FileInputFormat: Total input > files to process : 1 > 2019-03-29T18:10:16,033 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random > IO seek policy > 2019-03-29T18:10:16,070 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random > IO seek policy > 2019-03-29T18:10:16,080 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: > RecordReader initialized will read a total of 4584 records. > 2019-03-29T18:10:16,080 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: at > row 0. reading next block > 2019-03-29T18:10:16,089 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: > block read in memory in 9 ms. row c ount = 4584 > 2019-03-29T18:10:16,287 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] mapred.FileInputFormat: Total input > files to process : 1 > 2019-03-29T18:10:16,356 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random > IO seek policy > 2019-03-29T18:10:16,404 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random > IO seek policy > 2019-03-29T18:10:16,415 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: > RecordReader initialized will read a total of 4584 records. > 2019-03-29T18:10:16,415 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: at > row 0. reading next block > 2019-03-29T18:10:16,426 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: > block read in memory in 11 ms. row count = 4584 > 2019-03-29T18:10:16,613 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] mapred.FileInputFormat: Total input > files to process : 1 > 2019-03-29T18:10:16,654 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random > IO seek policy > 2019-03-29T18:10:16,700 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random > IO seek policy > 2019-03-29T18:10:16,712 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: > RecordReader initialized will read a total of 240 records. > 2019-03-29T18:10:16,712 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: at > row 0. reading next block > 2019-03-29T18:10:16,722 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: > block read in memory in 10 ms. row count = 240 > 2019-03-29T18:10:16,895 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] mapred.FileInputFormat: Total input > files to process : 1 > 2019-03-29T18:10:16,934 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random > IO seek policy > 2019-03-29T18:10:17,004 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random > IO seek policy > 2019-03-29T18:10:17,015 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: > RecordReader initialized will read a total of 240 records. > 2019-03-29T18:10:17,015 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: at > row 0. reading next block > 2019-03-29T18:10:17,024 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: > block read in memory in 9 ms. row c ount = 240 > 2019-03-29T18:10:17,217 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] mapred.FileInputFormat: Total input > files to process : 1 > 2019-03-29T18:10:17,269 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random > IO seek policy > 2019-03-29T18:10:17,306 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random > IO seek policy > 2019-03-29T18:10:17,315 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: > RecordReader initialized will read a total of 240 records. > 2019-03-29T18:10:17,315 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: at > row 0. reading next block > 2019-03-29T18:10:17,325 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: > block read in memory in 10 ms. row count = 240 > 2019-03-29T18:10:17,478 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] mapred.FileInputFormat: Total input > files to process : 1 > 2019-03-29T18:10:17,513 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random > IO seek policy > 2019-03-29T18:10:17,548 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random > IO seek policy > 2019-03-29T18:10:17,559 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: > RecordReader initialized will read a total of 240 records. > 2019-03-29T18:10:17,559 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: at > row 0. reading next block > 2019-03-29T18:10:17,568 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: > block read in memory in 9 ms. row c ount = 240 > 2019-03-29T18:10:17,729 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] mapred.FileInputFormat: Total input > files to process : 1 > 2019-03-29T18:10:17,805 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random > IO seek policy > 2019-03-29T18:10:17,845 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random > IO seek policy > 2019-03-29T18:10:17,854 INFO [16b32706-3490-432d-b49e-67279ea88e15 > HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: > RecordReader initialized will read a total of 4584 records. -- This message was sent by Atlassian JIRA (v7.6.3#76005)