Hi, we are running Hive on Spark, we have an external table over snappy compressed csv file of size 917.4 M HDFS block size is set to 256 MB
as per my Understanding, if i run a query over that external table , it should launch 4 tasks. one for each block. but i am seeing one executor and one task processing all the file. trying to understand the reason behind, i went one step further to understand the block locality when i get the block locations for that file, i found [DatanodeInfoWithStorage[10.11.0.226:50010 ,DS-bf39d33d-48e1-4a8f-be48-b0953fdaad37,DISK], DatanodeInfoWithStorage[10.11.0.227:50010 ,DS-a760c1c8-ce0c-4eb8-8183-8d8ff5f24115,DISK], DatanodeInfoWithStorage[10.11.0.228:50010 ,DS-0e5427e2-b030-43f8-91c9-d8517e68414a,DISK]] DatanodeInfoWithStorage[10.11.0.226:50010 ,DS-f50ddf2f-b827-4845-b043-8b91ae4017c0,DISK], DatanodeInfoWithStorage[10.11.0.228:50010 ,DS-e8c9785f-c352-489b-8209-4307f3296211,DISK], DatanodeInfoWithStorage[10.11.0.225:50010 ,DS-6f6a3ffd-334b-45fd-ae0f-cc6eb268b0d2,DISK]] DatanodeInfoWithStorage[10.11.0.226:50010 ,DS-f8bea6a8-a433-4601-8070-f6c5da840e09,DISK], DatanodeInfoWithStorage[10.11.0.227:50010 ,DS-8aa3f249-790e-494d-87ee-bcfff2182a96,DISK], DatanodeInfoWithStorage[10.11.0.228:50010 ,DS-d06714f4-2fbb-48d3-b858-a023b5c44e9c,DISK] DatanodeInfoWithStorage[10.11.0.226:50010 ,DS-b3a00781-c6bd-498c-a487-5ce6aaa66f48,DISK], DatanodeInfoWithStorage[10.11.0.228:50010 ,DS-fa5aa339-e266-4e20-a360-e7cdad5dacc3,DISK], DatanodeInfoWithStorage[10.11.0.225:50010 ,DS-9d597d3f-cd4f-4c8f-8a13-7be37ce769c9,DISK]] and in the spark UI i see the Locality Level is RACK_LOCAL. for that task if it is RACK_LOCAL then it should run either in node 10.11.0.226 or 10.11.0.228, because these 2 nodes has all the four blocks needed for computation but the executor is running in 10.11.0.225 my theory is not applying anywhere. please help me in understanding how spark/yarn calculates number of executors/tasks. Thanks, -Yeshwanth