Re: data localisation in spark

Sandy Ryza Tue, 02 Jun 2015 10:37:18 -0700

It is not possible with JavaSparkContext either.  The API mentioned below
currently does not have any effect (we should document this).


The primary difference between MR and Spark here is that MR runs each task
in its own YARN container, while Spark runs multiple tasks within an
executor, which needs to be requested before Spark knows what tasks it will
run.  Although dynamic allocation improves that last part.

-Sandy

On Tue, Jun 2, 2015 at 9:55 AM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> Is it possible in JavaSparkContext ?
>
> JavaSparkContext jsc = new JavaSparkContext(conf);
> JavaRDD<String>lines = jsc.textFile(args[0]);
>
> If yes , does its programmer's responsibilty to first calculate splits
> locations and then instantiate spark context with preferred locations?
>
> How does its achieved in MR2 with yarn, there is Application Master
> specifies split locations to ResourceManager before acquiring the node
> managers ?
>
>
>
> On Mon, Jun 1, 2015 at 7:24 AM, bit1...@163.com <bit1...@163.com> wrote:
>
>> Take a look at the following SparkContext constructor variant that tries
>> to honor the data locality in YARN mode.
>>
>>   /**
>> * :: DeveloperApi ::
>> * Alternative constructor for setting preferred locations where Spark
>> will create executors.
>> *
>> * @param preferredNodeLocationData used in YARN mode to select nodes to
>> launch containers on.
>> * Can be generated using
>> [[org.apache.spark.scheduler.InputFormatInfo.computePreferredLocations]]
>> * from a list of input files or InputFormats for the application.
>> */
>> @DeveloperApi
>> def this(config: SparkConf, preferredNodeLocationData: Map[String,
>> Set[SplitInfo]]) = {
>> this(config)
>> this.preferredNodeLocationData = preferredNodeLocationData
>> }
>>
>> ------------------------------
>> bit1...@163.com
>>
>>
>> *From:* Shushant Arora <shushantaror...@gmail.com>
>> *Date:* 2015-05-31 22:54
>> *To:* user <user@spark.apache.org>
>> *Subject:* data localisation in spark
>>
>> I want to understand how  spark takes care of data localisation in
>> cluster mode when run on YARN.
>>
>> 1.Driver program asks ResourceManager for executors. Does it tell yarn's
>> RM to check HDFS blocks of input data and then allocate executors to it.
>> And executors remain fixed throughout application or driver program asks
>> for new executors when it submits another job in same application , since
>> in spark new job is created for each action . If executors are fixed then
>> for second job achieving data localisation is impossible?
>>
>>
>>
>> 2.When executors are done with their processing, does they are marked as
>> free in ResourceManager's resoruce queue and  executors directly tell this
>> to Rm  instead of via driver's ?
>>
>> Thanks
>> Shushant
>>
>>
>

Re: data localisation in spark

Reply via email to