Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread yeshwanth kumar
Hi Ayan, , thanks for the explanation, I am aware of compression codecs. How does locality level set? Is it done by Spark or yarn? Please let me know, Thanks, Yesh On Nov 22, 2016 5:13 PM, "ayan guha" wrote: Hi RACK_LOCAL = Task running on the same rack but not on

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread ayan guha
Hi RACK_LOCAL = Task running on the same rack but not on the same node where data is NODE_LOCAL = task and data is co-located. Probably you were looking for this one? GZIP - Read is through GZIP codec, but because it is non-splittable, so you can have atmost 1 task reading a gzip file. Now, the

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread yeshwanth kumar
Hi Ayan, we have default rack topology. -Yeshwanth Can you Imagine what I would do if I could do all I can - Art of War On Tue, Nov 22, 2016 at 6:37 AM, ayan guha wrote: > Because snappy is not splittable, so single task makes sense. > > Are sure about rack topology?

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread ayan guha
Because snappy is not splittable, so single task makes sense. Are sure about rack topology? Ie 225 is in a different rack than 227 or 228? What does your topology file says? On 22 Nov 2016 10:14, "yeshwanth kumar" wrote: > Thanks for your reply, > > i can definitely

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread yeshwanth kumar
Thanks for your reply, i can definitely change the underlying compression format. but i am trying to understand the Locality Level, why executor ran on a different node, where the blocks are not present, when Locality Level is RACK_LOCAL can you shed some light on this. Thanks, Yesh

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread Jörn Franke
Use as a format orc, parquet or avro because they support any compression type with parallel processing. Alternatively split your file in several smaller ones. Another alternative would be bzip2 (but slower in general) or Lzo (usually it is not included by default in many distributions). > On

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread Aniket Bhatnagar
Try changing compression to bzip2 or lzo. For reference - http://comphadoop.weebly.com Thanks, Aniket On Mon, Nov 21, 2016, 10:18 PM yeshwanth kumar wrote: > Hi, > > we are running Hive on Spark, we have an external table over snappy > compressed csv file of size 917.4 M

RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread yeshwanth kumar
Hi, we are running Hive on Spark, we have an external table over snappy compressed csv file of size 917.4 M HDFS block size is set to 256 MB as per my Understanding, if i run a query over that external table , it should launch 4 tasks. one for each block. but i am seeing one executor and one