Re: RDD Partitions on HDFS file in Hive on Spark Query

yeshwanth kumar Mon, 21 Nov 2016 15:14:38 -0800

Thanks for your reply,

i can definitely change the underlying compression format.
but i am trying to understand the Locality Level,
why executor ran on a different node, where the blocks are not present,
when Locality Level is RACK_LOCAL


can you shed some light on this.


Thanks,
Yesh


-Yeshwanth
Can you Imagine what I would do if I could do all I can - Art of War

On Mon, Nov 21, 2016 at 4:59 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Use as a format orc, parquet or avro because they support any compression
> type with parallel processing. Alternatively split your file in several
> smaller ones. Another alternative would be bzip2 (but slower in general) or
> Lzo (usually it is not included by default in many distributions).
>
> On 21 Nov 2016, at 23:17, yeshwanth kumar <yeshwant...@gmail.com> wrote:
>
> Hi,
>
> we are running Hive on Spark, we have an external table over snappy
> compressed csv file of size 917.4 M
> HDFS block size is set to 256 MB
>
> as per my Understanding, if i run a query over that external table , it
> should launch 4 tasks. one for each block.
> but i am seeing one executor and one task processing all the file.
>
> trying to understand the reason behind,
>
> i went one step further to understand the block locality
> when i get the block locations for that file, i found
>
> [DatanodeInfoWithStorage[10.11.0.226:50010,DS-bf39d33d-
> 48e1-4a8f-be48-b0953fdaad37,DISK],
>  DatanodeInfoWithStorage[10.11.0.227:50010,DS-a760c1c8-
> ce0c-4eb8-8183-8d8ff5f24115,DISK],
>  DatanodeInfoWithStorage[10.11.0.228:50010,DS-0e5427e2-
> b030-43f8-91c9-d8517e68414a,DISK]]
>
> DatanodeInfoWithStorage[10.11.0.226:50010,DS-f50ddf2f-b827-
> 4845-b043-8b91ae4017c0,DISK],
> DatanodeInfoWithStorage[10.11.0.228:50010,DS-e8c9785f-c352-
> 489b-8209-4307f3296211,DISK],
> DatanodeInfoWithStorage[10.11.0.225:50010,DS-6f6a3ffd-334b-
> 45fd-ae0f-cc6eb268b0d2,DISK]]
>
> DatanodeInfoWithStorage[10.11.0.226:50010,DS-f8bea6a8-a433-
> 4601-8070-f6c5da840e09,DISK],
> DatanodeInfoWithStorage[10.11.0.227:50010,DS-8aa3f249-790e-
> 494d-87ee-bcfff2182a96,DISK],
> DatanodeInfoWithStorage[10.11.0.228:50010,DS-d06714f4-2fbb-
> 48d3-b858-a023b5c44e9c,DISK]
>
> DatanodeInfoWithStorage[10.11.0.226:50010,DS-b3a00781-c6bd-
> 498c-a487-5ce6aaa66f48,DISK],
> DatanodeInfoWithStorage[10.11.0.228:50010,DS-fa5aa339-e266-
> 4e20-a360-e7cdad5dacc3,DISK],
> DatanodeInfoWithStorage[10.11.0.225:50010,DS-9d597d3f-cd4f-
> 4c8f-8a13-7be37ce769c9,DISK]]
>
> and in the spark UI i see the Locality Level is  RACK_LOCAL. for that task
>
> if it is RACK_LOCAL then it should run either in node 10.11.0.226 or
> 10.11.0.228, because these 2 nodes has all the four blocks needed for
> computation
> but the executor is running in 10.11.0.225
>
> my theory is not applying anywhere.
>
> please help me in understanding how spark/yarn calculates number of
> executors/tasks.
>
> Thanks,
> -Yeshwanth
>
>

Re: RDD Partitions on HDFS file in Hive on Spark Query

Reply via email to