[ 
https://issues.apache.org/jira/browse/HIVE-21546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

t oo updated HIVE-21546:
------------------------
    Affects Version/s: 3.1.1
                       2.3.4
          Component/s: StorageHandler
                       storage-api
                       File Formats

> hiveserver2 - “mapred.FileInputFormat: Total input files to process” - why 
> single threaded?
> -------------------------------------------------------------------------------------------
>
>                 Key: HIVE-21546
>                 URL: https://issues.apache.org/jira/browse/HIVE-21546
>             Project: Hive
>          Issue Type: Bug
>          Components: File Formats, storage-api, StorageHandler
>    Affects Versions: 3.1.1, 2.3.4
>            Reporter: t oo
>            Priority: Major
>
> I have setup Hive (v2.3.4) on Spark (exec engine, but MR gets same issue), 
> hadoop 2.7.6 (or hadoop 2.8.5). My external hive table is Parquet format on 
> s3 across 100s of partitions. Below settings are set to 20:
> {\{hive.exec.input.listing.max.threads mapred.dfsclient.parallelism.max 
> mapreduce.input.fileinputformat.list-status.num-threads }}
> Run a simple query:
> {\{select * from s.there h_code = 'KGD78' and h_no = '265' }}
> I can see the below in HiveServer2 logs (the logs continue for more than 1000 
> lines listing all the different partitions). Why is the listing of files not 
> being done in parallel? It takes more than 5mins just in the listing.
> {{2019-03-29T11:29:26,866 INFO [3fa82455-7853-4c4b-8964-847c00bec708 
> HiveServer2-Handler-Pool: Thread-53] compress.CodecPool: Got brand-new 
> decompressor [.snappy] 2019-03-29T11:29:27,283 INFO 
> [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] 
> mapred.FileInputFormat: Total input files to process : 1 
> 2019-03-29T11:29:27,797 INFO [3fa82455-7853-4c4b-8964-847c00bec708 
> HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input 
> files to process : 1 2019-03-29T11:29:28,374 INFO 
> [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] 
> mapred.FileInputFormat: Total input files to process : 1 
> 2019-03-29T11:29:28,919 INFO [3fa82455-7853-4c4b-8964-847c00bec708 
> HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input 
> files to process : 1 2019-03-29T11:29:29,483 INFO 
> [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] 
> mapred.FileInputFormat: Total input files to process : 1 
> 2019-03-29T11:29:30,003 INFO [3fa82455-7853-4c4b-8964-847c00bec708 
> HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input 
> files to process : 1 2019-03-29T11:29:30,518 INFO 
> [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] 
> mapred.FileInputFormat: Total input files to process : 1 
> 2019-03-29T11:29:31,001 INFO [3fa82455-7853-4c4b-8964-847c00bec708 
> HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input 
> files to process : 1 2019-03-29T11:29:31,549 INFO 
> [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] 
> mapred.FileInputFormat: Total input files to process : 1 
> 2019-03-29T11:29:32,048 INFO [3fa82455-7853-4c4b-8964-847c00bec708 
> HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input 
> files to process : 1 2019-03-29T11:29:32,574 INFO 
> [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] 
> mapred.FileInputFormat: Total input files to process : 1 
> 2019-03-29T11:29:33,130 INFO [3fa82455-7853-4c4b-8964-847c00bec708 
> HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input 
> files to process : 1 2019-03-29T11:29:33,639 INFO 
> [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] 
> mapred.FileInputFormat: Total input files to process : 1 
> 2019-03-29T11:29:34,189 INFO [3fa82455-7853-4c4b-8964-847c00bec708 
> HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input 
> files to process : 1 2019-03-29T11:29:34,743 INFO 
> [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] 
> mapred.FileInputFormat: Total input files to process : 1 
> 2019-03-29T11:29:35,208 INFO [3fa82455-7853-4c4b-8964-847c00bec708 
> HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input 
> files to process : 1 2019-03-29T11:29:35,701 INFO 
> [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] 
> mapred.FileInputFormat: Total input files to process : 1 
> 2019-03-29T11:29:36,183 INFO [3fa82455-7853-4c4b-8964-847c00bec708 
> HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input 
> files to process : 1 2019-03-29T11:29:36,662 INFO 
> [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] 
> mapred.FileInputFormat: Total input files to process : 1 
> 2019-03-29T11:29:37,154 INFO [3fa82455-7853-4c4b-8964-847c00bec708 
> HiveServer2-Handler-Pool: Thread-53] mapred.FileInputFormat: Total input 
> files to process : 1 2019-03-29T11:29:37,645 INFO 
> [3fa82455-7853-4c4b-8964-847c00bec708 HiveServer2-Handler-Pool: Thread-53] 
> mapred.FileInputFormat: Total input files to process : 1 }}
> I've tried
> {\{hive.exec.input.listing.max.threads mapred.dfsclient.parallelism.max 
> mapreduce.input.fileinputformat.list-status.num-threads }}
> with defaults, 1, 50...still same result
>  
>  
>  
> Hive 3.1.1/hadoop3.1.2 also has the issue:
>  
> 2019-03-29T18:10:15,451 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: at 
> row 0. reading next block
> 2019-03-29T18:10:15,461 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: 
> block read in memory in 10 ms. row count = 4584
> 2019-03-29T18:10:15,620 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] mapred.FileInputFormat: Total input 
> files to process : 1
> 2019-03-29T18:10:15,714 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random 
> IO seek policy
> 2019-03-29T18:10:15,757 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random 
> IO seek policy
> 2019-03-29T18:10:15,767 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: 
> RecordReader initialized will read a total of 4584 records.
> 2019-03-29T18:10:15,767 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: at 
> row 0. reading next block
> 2019-03-29T18:10:15,777 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: 
> block read in memory in 10 ms. row count = 4584
> 2019-03-29T18:10:15,984 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] mapred.FileInputFormat: Total input 
> files to process : 1
> 2019-03-29T18:10:16,033 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random 
> IO seek policy
> 2019-03-29T18:10:16,070 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random 
> IO seek policy
> 2019-03-29T18:10:16,080 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: 
> RecordReader initialized will read a total of 4584 records.
> 2019-03-29T18:10:16,080 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: at 
> row 0. reading next block
> 2019-03-29T18:10:16,089 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: 
> block read in memory in 9 ms. row c ount = 4584
> 2019-03-29T18:10:16,287 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] mapred.FileInputFormat: Total input 
> files to process : 1
> 2019-03-29T18:10:16,356 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random 
> IO seek policy
> 2019-03-29T18:10:16,404 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random 
> IO seek policy
> 2019-03-29T18:10:16,415 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: 
> RecordReader initialized will read a total of 4584 records.
> 2019-03-29T18:10:16,415 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: at 
> row 0. reading next block
> 2019-03-29T18:10:16,426 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: 
> block read in memory in 11 ms. row count = 4584
> 2019-03-29T18:10:16,613 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] mapred.FileInputFormat: Total input 
> files to process : 1
> 2019-03-29T18:10:16,654 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random 
> IO seek policy
> 2019-03-29T18:10:16,700 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random 
> IO seek policy
> 2019-03-29T18:10:16,712 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: 
> RecordReader initialized will read a total of 240 records.
> 2019-03-29T18:10:16,712 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: at 
> row 0. reading next block
> 2019-03-29T18:10:16,722 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: 
> block read in memory in 10 ms. row count = 240
> 2019-03-29T18:10:16,895 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] mapred.FileInputFormat: Total input 
> files to process : 1
> 2019-03-29T18:10:16,934 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random 
> IO seek policy
> 2019-03-29T18:10:17,004 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random 
> IO seek policy
> 2019-03-29T18:10:17,015 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: 
> RecordReader initialized will read a total of 240 records.
> 2019-03-29T18:10:17,015 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: at 
> row 0. reading next block
> 2019-03-29T18:10:17,024 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: 
> block read in memory in 9 ms. row c ount = 240
> 2019-03-29T18:10:17,217 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] mapred.FileInputFormat: Total input 
> files to process : 1
> 2019-03-29T18:10:17,269 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random 
> IO seek policy
> 2019-03-29T18:10:17,306 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random 
> IO seek policy
> 2019-03-29T18:10:17,315 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: 
> RecordReader initialized will read a total of 240 records.
> 2019-03-29T18:10:17,315 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: at 
> row 0. reading next block
> 2019-03-29T18:10:17,325 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: 
> block read in memory in 10 ms. row count = 240
> 2019-03-29T18:10:17,478 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] mapred.FileInputFormat: Total input 
> files to process : 1
> 2019-03-29T18:10:17,513 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random 
> IO seek policy
> 2019-03-29T18:10:17,548 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random 
> IO seek policy
> 2019-03-29T18:10:17,559 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: 
> RecordReader initialized will read a total of 240 records.
> 2019-03-29T18:10:17,559 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: at 
> row 0. reading next block
> 2019-03-29T18:10:17,568 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: 
> block read in memory in 9 ms. row c ount = 240
> 2019-03-29T18:10:17,729 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] mapred.FileInputFormat: Total input 
> files to process : 1
> 2019-03-29T18:10:17,805 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random 
> IO seek policy
> 2019-03-29T18:10:17,845 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] s3a.S3AInputStream: Switching to Random 
> IO seek policy
> 2019-03-29T18:10:17,854 INFO [16b32706-3490-432d-b49e-67279ea88e15 
> HiveServer2-Handler-Pool: Thread-30] hadoop.InternalParquetRecordReader: 
> RecordReader initialized will read a total of 4584 records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to