Re: how can i use spark with yarn cluster in java

2023-09-06 Thread Mich Talebzadeh
Sounds like a network issue, for example connecting to remote server?

try
ping 172.21.242.26
telnet 172.21.242.26 596590

or nc -vz 172.21.242.26 596590

example
nc -vz rhes76 1521
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 50.140.197.230:1521.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 6 Sept 2023 at 18:48, BCMS  wrote:

> i want to use yarn cluster with my current code. if i use
> conf.set("spark.master","local[*]") inplace of
> conf.set("spark.master","yarn"), everything is very well. but i try to use
> yarn in setmaster, my code give an below error.
>
>
> ```
> package com.example.pocsparkspring;
>
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.security.UserGroupInformation;
> import org.apache.spark.SparkConf;
> import org.apache.spark.api.java.JavaSparkContext;
> import org.apache.spark.sql.*;
>
> public class poc2 {
> public static void main(String[] args) {
> try {
> System.setProperty("hadoop.home.dir", "D:/hadoop");
> String HADOOP_CONF_DIR =
> "D:/Dev/ws/poc-spark-spring/src/main/resources/";
> Configuration configuration1 = new Configuration();
> configuration1.addResource(new Path(HADOOP_CONF_DIR +
> "core-site.xml"));
> configuration1.addResource(new Path(HADOOP_CONF_DIR +
> "hdfs-site.xml"));
> configuration1.addResource(new Path(HADOOP_CONF_DIR +
> "yarn-site.xml"));
> configuration1.addResource(new Path(HADOOP_CONF_DIR +
> "mapred-site.xml"));
> UserGroupInformation.setConfiguration(configuration1);
> UserGroupInformation.loginUserFromKeytab("user",
> HADOOP_CONF_DIR + "mykeytab.keytab");
>
>
> SparkConf conf = new SparkConf();
> conf.set("spark.master","yarn");
> conf.set("spark.kerberos.keytab", HADOOP_CONF_DIR +
> "mykeytab.keytab");
> conf.set("spark.kerberos.principal", "user");
> conf.set("hadoop.security.authentication", "kerberos");
> conf.set("hadoop.security.authorization", "true");
> conf.set("dfs.client.use.datanode.hostname", "true");
> conf.set("spark.security.credentials.hbase.enabled", "false");
> conf.set("spark.security.credentials.hive.enabled", "false");
> conf.setAppName("poc");
> conf.setJars(new String[]{"target/poc-spark-spring.jar"});
>
>
> JavaSparkContext sparkContext = new JavaSparkContext(conf);
> SparkSession sparkSession = SparkSession.builder()
> .sparkContext(sparkContext.sc())
> .getOrCreate();
>
> sparkSession.sparkContext().hadoopConfiguration().addResource(new
> Path(HADOOP_CONF_DIR + "core-site.xml"));
>
> sparkSession.sparkContext().hadoopConfiguration().addResource(new
> Path(HADOOP_CONF_DIR + "hdfs-site.xml"));
>
> sparkSession.sparkContext().hadoopConfiguration().addResource(new
> Path(HADOOP_CONF_DIR + "yarn-site.xml"));
>
> sparkSession.sparkContext().hadoopConfiguration().addResource(new
> Path(HADOOP_CONF_DIR + "mapred-site.xml"));
>
> sparkSession.sparkContext().conf().set("spark.kerberos.keytab",
> HADOOP_CONF_DIR + "mykeytab.keytab");
>
> sparkSession.sparkContext().conf().set("spark.kerberos.principal", "user");
>
> sparkSession.sparkContext().conf().set("hadoop.security.authentication",
> "kerberos");
>
> sparkSession.sparkContext().conf().set("hadoop.security.authorization",
> "true");
>
> sparkSession.sparkContext().conf().set("spark.security.credentials.hbase.enabled",
> "false");
>
> sparkSession.sparkContext().conf().set("spark.security.credentials.hive.enabled",
> "false");
> Dataset oracleDF = sparkSession.read()
> .format("jdbc")
> .option("fetchsize", 1)
> .option("url", "jdbc:oracle:thin:@exampledev
> :1521:exampledev")
> .option("query", "select * FROM exampletable")
> .option("driver", "oracle.jdbc.driver.OracleDriver")
> .option("user", "user")
> .option("password", "pass")
> .load();
> oracleDF.show(false);
>
>
> oracleDF.write().mode(SaveMode.Append).format("csv").save("hdfs:///hive/stage_space/example");
>
> } catch (Exception e) {
> 

how can i use spark with yarn cluster in java

2023-09-06 Thread BCMS
i want to use yarn cluster with my current code. if i use
conf.set("spark.master","local[*]") inplace of
conf.set("spark.master","yarn"), everything is very well. but i try to use
yarn in setmaster, my code give an below error.


```
package com.example.pocsparkspring;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.security.UserGroupInformation;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.*;

public class poc2 {
public static void main(String[] args) {
try {
System.setProperty("hadoop.home.dir", "D:/hadoop");
String HADOOP_CONF_DIR =
"D:/Dev/ws/poc-spark-spring/src/main/resources/";
Configuration configuration1 = new Configuration();
configuration1.addResource(new Path(HADOOP_CONF_DIR +
"core-site.xml"));
configuration1.addResource(new Path(HADOOP_CONF_DIR +
"hdfs-site.xml"));
configuration1.addResource(new Path(HADOOP_CONF_DIR +
"yarn-site.xml"));
configuration1.addResource(new Path(HADOOP_CONF_DIR +
"mapred-site.xml"));
UserGroupInformation.setConfiguration(configuration1);
UserGroupInformation.loginUserFromKeytab("user",
HADOOP_CONF_DIR + "mykeytab.keytab");


SparkConf conf = new SparkConf();
conf.set("spark.master","yarn");
conf.set("spark.kerberos.keytab", HADOOP_CONF_DIR +
"mykeytab.keytab");
conf.set("spark.kerberos.principal", "user");
conf.set("hadoop.security.authentication", "kerberos");
conf.set("hadoop.security.authorization", "true");
conf.set("dfs.client.use.datanode.hostname", "true");
conf.set("spark.security.credentials.hbase.enabled", "false");
conf.set("spark.security.credentials.hive.enabled", "false");
conf.setAppName("poc");
conf.setJars(new String[]{"target/poc-spark-spring.jar"});


JavaSparkContext sparkContext = new JavaSparkContext(conf);
SparkSession sparkSession = SparkSession.builder()
.sparkContext(sparkContext.sc())
.getOrCreate();

sparkSession.sparkContext().hadoopConfiguration().addResource(new
Path(HADOOP_CONF_DIR + "core-site.xml"));

sparkSession.sparkContext().hadoopConfiguration().addResource(new
Path(HADOOP_CONF_DIR + "hdfs-site.xml"));

sparkSession.sparkContext().hadoopConfiguration().addResource(new
Path(HADOOP_CONF_DIR + "yarn-site.xml"));

sparkSession.sparkContext().hadoopConfiguration().addResource(new
Path(HADOOP_CONF_DIR + "mapred-site.xml"));
sparkSession.sparkContext().conf().set("spark.kerberos.keytab",
HADOOP_CONF_DIR + "mykeytab.keytab");

sparkSession.sparkContext().conf().set("spark.kerberos.principal", "user");

sparkSession.sparkContext().conf().set("hadoop.security.authentication",
"kerberos");

sparkSession.sparkContext().conf().set("hadoop.security.authorization",
"true");

sparkSession.sparkContext().conf().set("spark.security.credentials.hbase.enabled",
"false");

sparkSession.sparkContext().conf().set("spark.security.credentials.hive.enabled",
"false");
Dataset oracleDF = sparkSession.read()
.format("jdbc")
.option("fetchsize", 1)
.option("url", "jdbc:oracle:thin:@exampledev
:1521:exampledev")
.option("query", "select * FROM exampletable")
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("user", "user")
.option("password", "pass")
.load();
oracleDF.show(false);


oracleDF.write().mode(SaveMode.Append).format("csv").save("hdfs:///hive/stage_space/example");

} catch (Exception e) {
e.printStackTrace();
}
}
}

```

as i mentioned my code give an error. firstly everything looks good. i
control cluster gui, i can see in there my job. My job status changed
"accepted" but after a while it gives below error. and i can see same error
in gui log.


```
09:38:45.129 [YARN application state monitor] ERROR
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - YARN
application has exited unexpectedly with state FAILED! Check the YARN
application logs for more details.
09:38:45.129 [YARN application state monitor] ERROR
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - Diagnostics
message: Uncaught exception: org.apache.spark.SparkException: Exception
thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
at
org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:558)
at

[Spark RPC]: Yarn - Application Master / executors to Driver communication issue

2023-07-14 Thread Sunayan Saikia
Hey Spark Community,

Our Jupyterhub/Jupyterlab (with spark client) runs behind two layers of
HAProxy and the Yarn cluster runs remotely. We want to use deploy mode
'client' so that we can capture the output of any spark sql query in
jupyterlab. I'm aware of other technologies like Livy and Spark Connect,
however, we want to do things without using any of these at the moment.

With Spark 3, we have seen that during spark session creation itself, the
*ApplicationMaster* attempts fail to talk back to the Driver with an
Exception of *'awaitResults - Too Large Frame: 5211803372140375592 -
Connection closed'.* This doesn't look like a correct exception because it
fails just during the creating the spark session without any query load.

We are using the configs *spark.driver.host*, *spark.driver.port,*
*spark.driver.blockManager.port* and *spark.driver.bindAddress. *We have
tested that, from outside, the host and the ports used with the above
configs, are already accessible.

I'm wondering if Spark supports this type of communication? Any suggestions
to debug this further?

-- 
Thanks,
Sunayan Saikia


Re: Spark-on-Yarn ClassNotFound Exception

2022-12-18 Thread Hariharan
(DataSource.scala:750)
>>>> at
>>>> org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:579)
>>>> at
>>>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
>>>> at
>>>> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
>>>> at
>>>> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
>>>> at scala.Option.getOrElse(Option.scala:189)
>>>> at
>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
>>>>
>>>> Thanks again!
>>>>
>>>> On Tue, Dec 13, 2022 at 9:52 PM scrypso  wrote:
>>>>
>>>>> Two ideas you could try:
>>>>>
>>>>> You can try spark.driver.extraClassPath as well. Spark loads the
>>>>> user's jar in a child classloader, so Spark/Yarn/Hadoop can only see your
>>>>> classes reflectively. Hadoop's Configuration should use the thread ctx
>>>>> classloader, and Spark should set that to the loader that loads your jar.
>>>>> The extraClassPath option just adds jars directly to the Java command that
>>>>> creates the driver/executor.
>>>>>
>>>>> I can't immediately tell how your error might arise, unless there is
>>>>> some timing issue with the Spark and Hadoop setup. Can you share the full
>>>>> stacktrace of the ClassNotFound exception? That might tell us when Hadoop
>>>>> is looking up this class.
>>>>>
>>>>> Good luck!
>>>>> - scrypso
>>>>>
>>>>>
>>>>> On Tue, Dec 13, 2022, 17:05 Hariharan  wrote:
>>>>>
>>>>>> Missed to mention it above, but just to add, the error is coming from
>>>>>> the driver. I tried using *--driver-class-path /path/to/my/jar* as
>>>>>> well, but no luck.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> On Mon, Dec 12, 2022 at 4:21 PM Hariharan 
>>>>>> wrote:
>>>>>>
>>>>>>> Hello folks,
>>>>>>>
>>>>>>> I have a spark app with a custom implementation of
>>>>>>> *fs.s3a.s3.client.factory.impl* which is packaged into the same jar.
>>>>>>> Output of *jar tf*
>>>>>>>
>>>>>>> *2620 Mon Dec 12 11:23:00 IST 2022 aws/utils/MyS3ClientFactory.class*
>>>>>>>
>>>>>>> However when I run the my spark app with spark-submit in cluster
>>>>>>> mode, it fails with the following error:
>>>>>>>
>>>>>>> *java.lang.RuntimeException: java.lang.RuntimeException:
>>>>>>> java.lang.ClassNotFoundException: Class aws.utils.MyS3ClientFactory not
>>>>>>> found*
>>>>>>>
>>>>>>> I tried:
>>>>>>> 1. passing in the jar to the *--jars* option (with the local path)
>>>>>>> 2. Passing in the jar to *spark.yarn.jars* option with an HDFS path
>>>>>>>
>>>>>>> but still the same error.
>>>>>>>
>>>>>>> Any suggestions on what I'm missing?
>>>>>>>
>>>>>>> Other pertinent details:
>>>>>>> Spark version: 3.3.0
>>>>>>> Hadoop version: 3.3.4
>>>>>>>
>>>>>>> Command used to run the app
>>>>>>> */spark/bin/spark-submit --class MyMainClass --deploy-mode cluster
>>>>>>> --master yarn  --conf spark.executor.instances=6   /path/to/my/jar*
>>>>>>>
>>>>>>> TIA!
>>>>>>>
>>>>>>


Re: Spark-on-Yarn ClassNotFound Exception

2022-12-15 Thread scrypso
Hmm, did you mean spark.*driver*.extraClassPath? That is very odd then - if
you check the logs directory for the driver (on the cluster) I think there
should be a launch container log, where you can see the exact command used
to start the JVM (at the very end), and a line starting "export CLASSPATH"
- you can double check that your jar looks to be included correctly there.
If it is I think you have a really "interesting" issue on your hands!

- scrypso

On Wed, Dec 14, 2022, 05:17 Hariharan  wrote:

> Hi scrypso,
>
> Thanks for the help so far, and I think you're definitely on to something
> here. I tried loading the class as you suggested with the code below:
>
> try {
> 
> Thread.currentThread().getContextClassLoader().loadClass(MyS3ClientFactory.class.getCanonicalName());
> logger.info("Loaded custom class");
> } catch (ClassNotFoundException e) {
> logger.error("Unable to load class", e);
> }
> return spark.read().option("mode", 
> "DROPMALFORMED").format("avro").load();
>
> I am able to load the custom class as above
> *2022-12-14 04:12:34,158 INFO  [Driver] utils.S3Reader - Loaded custom
> class*
>
> But the spark.read code below it tries to initialize the s3 client and is
> not able to load the same class.
>
> I tried adding
> *--conf spark.executor.extraClassPath=myjar*
>
> as well, but no luck :-(
>
> Thanks again!
>
> On Tue, Dec 13, 2022 at 10:09 PM scrypso  wrote:
>
>> I'm on my phone, so can't compare with the Spark source, but that looks
>> to me like it should be well after the ctx loader has been set. You could
>> try printing the classpath of the loader
>> Thread.currentThread().getThreadContextClassLoader(), or try to load your
>> class from that yourself to see if you get the same error.
>>
>> Can you see which thread is throwing the exception? If it is a different
>> thread than the "main" application thread it might not have the thread ctx
>> loader set correctly. I can't see any of your classes in the stacktrace - I
>> assume that is because of your scrubbing, but it could also be because this
>> is run in separate thread without ctx loader set.
>>
>> It also looks like Hadoop is caching the FileSystems somehow - perhaps
>> you can create the S3A filesystem yourself and hope it picks that up? (Wild
>> guess, no idea if that works or how hard it would be.)
>>
>>
>> On Tue, Dec 13, 2022, 17:29 Hariharan  wrote:
>>
>>> Thanks for the response, scrypso! I will try adding the extraClassPath
>>> option. Meanwhile, please find the full stack trace below (I have
>>> masked/removed references to proprietary code)
>>>
>>> java.lang.RuntimeException: java.lang.RuntimeException:
>>> java.lang.ClassNotFoundException: Class foo.bar.MyS3ClientFactory not found
>>> at
>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2720)
>>> at
>>> org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:888)
>>> at
>>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:542)
>>> at
>>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
>>> at
>>> org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
>>> at
>>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
>>> at
>>> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
>>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
>>> at
>>> org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:752)
>>> at scala.collection.immutable.List.map(List.scala:293)
>>> at
>>> org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:750)
>>> at
>>> org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:579)
>>> at
>>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
>>> at
>>> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
>>> at
>>> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
>>> at scala.Option.getOrElse(Option.scala:189)
>>> at
>>> org.apache.spark.sql.DataFrameReader

Re: Spark-on-Yarn ClassNotFound Exception

2022-12-13 Thread Hariharan
Hi scrypso,

Thanks for the help so far, and I think you're definitely on to something
here. I tried loading the class as you suggested with the code below:

try {

Thread.currentThread().getContextClassLoader().loadClass(MyS3ClientFactory.class.getCanonicalName());
logger.info("Loaded custom class");
} catch (ClassNotFoundException e) {
logger.error("Unable to load class", e);
}
return spark.read().option("mode",
"DROPMALFORMED").format("avro").load();

I am able to load the custom class as above
*2022-12-14 04:12:34,158 INFO  [Driver] utils.S3Reader - Loaded custom
class*

But the spark.read code below it tries to initialize the s3 client and is
not able to load the same class.

I tried adding
*--conf spark.executor.extraClassPath=myjar*

as well, but no luck :-(

Thanks again!

On Tue, Dec 13, 2022 at 10:09 PM scrypso  wrote:

> I'm on my phone, so can't compare with the Spark source, but that looks to
> me like it should be well after the ctx loader has been set. You could try
> printing the classpath of the loader
> Thread.currentThread().getThreadContextClassLoader(), or try to load your
> class from that yourself to see if you get the same error.
>
> Can you see which thread is throwing the exception? If it is a different
> thread than the "main" application thread it might not have the thread ctx
> loader set correctly. I can't see any of your classes in the stacktrace - I
> assume that is because of your scrubbing, but it could also be because this
> is run in separate thread without ctx loader set.
>
> It also looks like Hadoop is caching the FileSystems somehow - perhaps you
> can create the S3A filesystem yourself and hope it picks that up? (Wild
> guess, no idea if that works or how hard it would be.)
>
>
> On Tue, Dec 13, 2022, 17:29 Hariharan  wrote:
>
>> Thanks for the response, scrypso! I will try adding the extraClassPath
>> option. Meanwhile, please find the full stack trace below (I have
>> masked/removed references to proprietary code)
>>
>> java.lang.RuntimeException: java.lang.RuntimeException:
>> java.lang.ClassNotFoundException: Class foo.bar.MyS3ClientFactory not found
>> at
>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2720)
>> at
>> org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:888)
>> at
>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:542)
>> at
>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
>> at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
>> at
>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
>> at
>> org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:752)
>> at scala.collection.immutable.List.map(List.scala:293)
>> at
>> org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:750)
>> at
>> org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:579)
>> at
>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
>> at
>> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
>> at
>> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
>> at scala.Option.getOrElse(Option.scala:189)
>> at
>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
>>
>> Thanks again!
>>
>> On Tue, Dec 13, 2022 at 9:52 PM scrypso  wrote:
>>
>>> Two ideas you could try:
>>>
>>> You can try spark.driver.extraClassPath as well. Spark loads the user's
>>> jar in a child classloader, so Spark/Yarn/Hadoop can only see your classes
>>> reflectively. Hadoop's Configuration should use the thread ctx classloader,
>>> and Spark should set that to the loader that loads your jar. The
>>> extraClassPath option just adds jars directly to the Java command that
>>> creates the driver/executor.
>>>
>>> I can't immediately tell how your error might arise, unless there is
>>> some timing issue with the Spark and Hadoop setup. Can you share the full
>>> stacktrace of the ClassNotFound exception? That might tell us whe

Re: Spark-on-Yarn ClassNotFound Exception

2022-12-13 Thread scrypso
I'm on my phone, so can't compare with the Spark source, but that looks to
me like it should be well after the ctx loader has been set. You could try
printing the classpath of the loader
Thread.currentThread().getThreadContextClassLoader(), or try to load your
class from that yourself to see if you get the same error.

Can you see which thread is throwing the exception? If it is a different
thread than the "main" application thread it might not have the thread ctx
loader set correctly. I can't see any of your classes in the stacktrace - I
assume that is because of your scrubbing, but it could also be because this
is run in separate thread without ctx loader set.

It also looks like Hadoop is caching the FileSystems somehow - perhaps you
can create the S3A filesystem yourself and hope it picks that up? (Wild
guess, no idea if that works or how hard it would be.)


On Tue, Dec 13, 2022, 17:29 Hariharan  wrote:

> Thanks for the response, scrypso! I will try adding the extraClassPath
> option. Meanwhile, please find the full stack trace below (I have
> masked/removed references to proprietary code)
>
> java.lang.RuntimeException: java.lang.RuntimeException:
> java.lang.ClassNotFoundException: Class foo.bar.MyS3ClientFactory not found
> at
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2720)
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:888)
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:542)
> at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
> at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
> at
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
> at
> org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:752)
> at scala.collection.immutable.List.map(List.scala:293)
> at
> org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:750)
> at
> org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:579)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
> at
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
> at
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
> at scala.Option.getOrElse(Option.scala:189)
> at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
>
> Thanks again!
>
> On Tue, Dec 13, 2022 at 9:52 PM scrypso  wrote:
>
>> Two ideas you could try:
>>
>> You can try spark.driver.extraClassPath as well. Spark loads the user's
>> jar in a child classloader, so Spark/Yarn/Hadoop can only see your classes
>> reflectively. Hadoop's Configuration should use the thread ctx classloader,
>> and Spark should set that to the loader that loads your jar. The
>> extraClassPath option just adds jars directly to the Java command that
>> creates the driver/executor.
>>
>> I can't immediately tell how your error might arise, unless there is some
>> timing issue with the Spark and Hadoop setup. Can you share the full
>> stacktrace of the ClassNotFound exception? That might tell us when Hadoop
>> is looking up this class.
>>
>> Good luck!
>> - scrypso
>>
>>
>> On Tue, Dec 13, 2022, 17:05 Hariharan  wrote:
>>
>>> Missed to mention it above, but just to add, the error is coming from
>>> the driver. I tried using *--driver-class-path /path/to/my/jar* as
>>> well, but no luck.
>>>
>>> Thanks!
>>>
>>> On Mon, Dec 12, 2022 at 4:21 PM Hariharan 
>>> wrote:
>>>
>>>> Hello folks,
>>>>
>>>> I have a spark app with a custom implementation of
>>>> *fs.s3a.s3.client.factory.impl* which is packaged into the same jar.
>>>> Output of *jar tf*
>>>>
>>>> *2620 Mon Dec 12 11:23:00 IST 2022 aws/utils/MyS3ClientFactory.class*
>>>>
>>>> However when I run the my spark app with spark-submit in cluster mode,
>>>> it fails with the following error:
>>>>
>>>> *java.lang.RuntimeException: java.lang.RuntimeException:
>>>> java.lang.ClassNotFoundException: Class aws.utils.MyS3ClientFactory not
>>>> found*
>

Re: Spark-on-Yarn ClassNotFound Exception

2022-12-13 Thread Hariharan
Thanks for the response, scrypso! I will try adding the extraClassPath
option. Meanwhile, please find the full stack trace below (I have
masked/removed references to proprietary code)

java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.ClassNotFoundException: Class foo.bar.MyS3ClientFactory not found
at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2720)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:888)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:542)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:752)
at scala.collection.immutable.List.map(List.scala:293)
at
org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:750)
at
org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:579)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
at
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
at scala.Option.getOrElse(Option.scala:189)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)

Thanks again!

On Tue, Dec 13, 2022 at 9:52 PM scrypso  wrote:

> Two ideas you could try:
>
> You can try spark.driver.extraClassPath as well. Spark loads the user's
> jar in a child classloader, so Spark/Yarn/Hadoop can only see your classes
> reflectively. Hadoop's Configuration should use the thread ctx classloader,
> and Spark should set that to the loader that loads your jar. The
> extraClassPath option just adds jars directly to the Java command that
> creates the driver/executor.
>
> I can't immediately tell how your error might arise, unless there is some
> timing issue with the Spark and Hadoop setup. Can you share the full
> stacktrace of the ClassNotFound exception? That might tell us when Hadoop
> is looking up this class.
>
> Good luck!
> - scrypso
>
>
> On Tue, Dec 13, 2022, 17:05 Hariharan  wrote:
>
>> Missed to mention it above, but just to add, the error is coming from the
>> driver. I tried using *--driver-class-path /path/to/my/jar* as well, but
>> no luck.
>>
>> Thanks!
>>
>> On Mon, Dec 12, 2022 at 4:21 PM Hariharan  wrote:
>>
>>> Hello folks,
>>>
>>> I have a spark app with a custom implementation of
>>> *fs.s3a.s3.client.factory.impl* which is packaged into the same jar.
>>> Output of *jar tf*
>>>
>>> *2620 Mon Dec 12 11:23:00 IST 2022 aws/utils/MyS3ClientFactory.class*
>>>
>>> However when I run the my spark app with spark-submit in cluster mode,
>>> it fails with the following error:
>>>
>>> *java.lang.RuntimeException: java.lang.RuntimeException:
>>> java.lang.ClassNotFoundException: Class aws.utils.MyS3ClientFactory not
>>> found*
>>>
>>> I tried:
>>> 1. passing in the jar to the *--jars* option (with the local path)
>>> 2. Passing in the jar to *spark.yarn.jars* option with an HDFS path
>>>
>>> but still the same error.
>>>
>>> Any suggestions on what I'm missing?
>>>
>>> Other pertinent details:
>>> Spark version: 3.3.0
>>> Hadoop version: 3.3.4
>>>
>>> Command used to run the app
>>> */spark/bin/spark-submit --class MyMainClass --deploy-mode cluster
>>> --master yarn  --conf spark.executor.instances=6   /path/to/my/jar*
>>>
>>> TIA!
>>>
>>


Re: Spark-on-Yarn ClassNotFound Exception

2022-12-13 Thread scrypso
Two ideas you could try:

You can try spark.driver.extraClassPath as well. Spark loads the user's jar
in a child classloader, so Spark/Yarn/Hadoop can only see your classes
reflectively. Hadoop's Configuration should use the thread ctx classloader,
and Spark should set that to the loader that loads your jar. The
extraClassPath option just adds jars directly to the Java command that
creates the driver/executor.

I can't immediately tell how your error might arise, unless there is some
timing issue with the Spark and Hadoop setup. Can you share the full
stacktrace of the ClassNotFound exception? That might tell us when Hadoop
is looking up this class.

Good luck!
- scrypso


On Tue, Dec 13, 2022, 17:05 Hariharan  wrote:

> Missed to mention it above, but just to add, the error is coming from the
> driver. I tried using *--driver-class-path /path/to/my/jar* as well, but
> no luck.
>
> Thanks!
>
> On Mon, Dec 12, 2022 at 4:21 PM Hariharan  wrote:
>
>> Hello folks,
>>
>> I have a spark app with a custom implementation of
>> *fs.s3a.s3.client.factory.impl* which is packaged into the same jar.
>> Output of *jar tf*
>>
>> *2620 Mon Dec 12 11:23:00 IST 2022 aws/utils/MyS3ClientFactory.class*
>>
>> However when I run the my spark app with spark-submit in cluster mode, it
>> fails with the following error:
>>
>> *java.lang.RuntimeException: java.lang.RuntimeException:
>> java.lang.ClassNotFoundException: Class aws.utils.MyS3ClientFactory not
>> found*
>>
>> I tried:
>> 1. passing in the jar to the *--jars* option (with the local path)
>> 2. Passing in the jar to *spark.yarn.jars* option with an HDFS path
>>
>> but still the same error.
>>
>> Any suggestions on what I'm missing?
>>
>> Other pertinent details:
>> Spark version: 3.3.0
>> Hadoop version: 3.3.4
>>
>> Command used to run the app
>> */spark/bin/spark-submit --class MyMainClass --deploy-mode cluster
>> --master yarn  --conf spark.executor.instances=6   /path/to/my/jar*
>>
>> TIA!
>>
>


Re: Spark-on-Yarn ClassNotFound Exception

2022-12-13 Thread Hariharan
Missed to mention it above, but just to add, the error is coming from the
driver. I tried using *--driver-class-path /path/to/my/jar* as well, but no
luck.

Thanks!

On Mon, Dec 12, 2022 at 4:21 PM Hariharan  wrote:

> Hello folks,
>
> I have a spark app with a custom implementation of
> *fs.s3a.s3.client.factory.impl* which is packaged into the same jar.
> Output of *jar tf*
>
> *2620 Mon Dec 12 11:23:00 IST 2022 aws/utils/MyS3ClientFactory.class*
>
> However when I run the my spark app with spark-submit in cluster mode, it
> fails with the following error:
>
> *java.lang.RuntimeException: java.lang.RuntimeException:
> java.lang.ClassNotFoundException: Class aws.utils.MyS3ClientFactory not
> found*
>
> I tried:
> 1. passing in the jar to the *--jars* option (with the local path)
> 2. Passing in the jar to *spark.yarn.jars* option with an HDFS path
>
> but still the same error.
>
> Any suggestions on what I'm missing?
>
> Other pertinent details:
> Spark version: 3.3.0
> Hadoop version: 3.3.4
>
> Command used to run the app
> */spark/bin/spark-submit --class MyMainClass --deploy-mode cluster
> --master yarn  --conf spark.executor.instances=6   /path/to/my/jar*
>
> TIA!
>


Spark-on-Yarn ClassNotFound Exception

2022-12-12 Thread Hariharan
Hello folks,

I have a spark app with a custom implementation of
*fs.s3a.s3.client.factory.impl* which is packaged into the same jar.
Output of *jar tf*

*2620 Mon Dec 12 11:23:00 IST 2022 aws/utils/MyS3ClientFactory.class*

However when I run the my spark app with spark-submit in cluster mode, it
fails with the following error:

*java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.ClassNotFoundException: Class aws.utils.MyS3ClientFactory not
found*

I tried:
1. passing in the jar to the *--jars* option (with the local path)
2. Passing in the jar to *spark.yarn.jars* option with an HDFS path

but still the same error.

Any suggestions on what I'm missing?

Other pertinent details:
Spark version: 3.3.0
Hadoop version: 3.3.4

Command used to run the app
*/spark/bin/spark-submit --class MyMainClass --deploy-mode cluster --master
yarn  --conf spark.executor.instances=6   /path/to/my/jar*

TIA!


Re: Spark 3.0 yarn does not support cdh5

2019-10-21 Thread melin li
Many clusters still use cdh5, and want to continue to support cdh5,cdh5
based on hadoop 2.6

melin li  于2019年10月21日周一 下午3:02写道:

> 很多集群还是使用cdh5,希望继续支持cdh5,cdh5是基于hadoop 2.6
>
> dev/make-distribution.sh --tgz -Pkubernetes -Pyarn -Phive-thriftserver
> -Phive -Dhadoop.version=2.6.0-cdh5.15.0 -DskipTest
>
> ```
> [INFO] Compiling 25 Scala sources to
> /Users/libinsong/Documents/codes/tongdun/spark-3.0/resource-managers/yarn/target/scala-2.12/classes
> ...
> [ERROR] [Error]
> /Users/libinsong/Documents/codes/tongdun/spark-3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:298:
> value setRolledLogsIncludePattern is not a member of
> org.apache.hadoop.yarn.api.records.LogAggregationContext
> [ERROR] [Error]
> /Users/libinsong/Documents/codes/tongdun/spark-3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:300:
> value setRolledLogsExcludePattern is not a member of
> org.apache.hadoop.yarn.api.records.LogAggregationContext
> [ERROR] [Error]
> /Users/libinsong/Documents/codes/tongdun/spark-3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:551:
> not found: value isLocalUri
> [ERROR] [Error]
> /Users/libinsong/Documents/codes/tongdun/spark-3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:1367:
> not found: value isLocalUri
> [ERROR] four errors found
> ```
>


Spark 3.0 yarn does not support cdh5

2019-10-21 Thread melin li
很多集群还是使用cdh5,希望继续支持cdh5,cdh5是基于hadoop 2.6

dev/make-distribution.sh --tgz -Pkubernetes -Pyarn -Phive-thriftserver
-Phive -Dhadoop.version=2.6.0-cdh5.15.0 -DskipTest

```
[INFO] Compiling 25 Scala sources to
/Users/libinsong/Documents/codes/tongdun/spark-3.0/resource-managers/yarn/target/scala-2.12/classes
...
[ERROR] [Error]
/Users/libinsong/Documents/codes/tongdun/spark-3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:298:
value setRolledLogsIncludePattern is not a member of
org.apache.hadoop.yarn.api.records.LogAggregationContext
[ERROR] [Error]
/Users/libinsong/Documents/codes/tongdun/spark-3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:300:
value setRolledLogsExcludePattern is not a member of
org.apache.hadoop.yarn.api.records.LogAggregationContext
[ERROR] [Error]
/Users/libinsong/Documents/codes/tongdun/spark-3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:551:
not found: value isLocalUri
[ERROR] [Error]
/Users/libinsong/Documents/codes/tongdun/spark-3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:1367:
not found: value isLocalUri
[ERROR] four errors found
```


Spark on YARN with private Docker repositories/registries

2019-08-16 Thread Tak-Lon (Stephen) Wu
Hi guys,

Have anyone been using spark (spark-submit) with yarn mode which pull
images from a private Docker repositories/registries ??

how do you pass in the docker config.json which included the auth tokens ?
or is there any environment variable can be added in the system environment
to make it load from it by default?

Thanks,
Stephen


Spark on Yarn - Dynamically getting a list of archives from --archives in spark-submit

2019-06-13 Thread Tommy Li
Hi

Is there any way to get a list of the archives submitted with a spark job from 
the spark context?
I see that spark context has a `.files()` function which returns the files 
included with `--files`, but I don't see an equivalent for `--archives`.


Thanks,
Tommy


Re: [spark on yarn] spark on yarn without DFS

2019-05-23 Thread Achilleus 003
This is interesting. Would really appreciate it if you could share what
exactly did you change in* core-site.xml *and *yarn-site.xml.*

On Wed, May 22, 2019 at 9:14 AM Gourav Sengupta 
wrote:

> just wondering what is the advantage of doing this?
>
> Regards
> Gourav Sengupta
>
> On Wed, May 22, 2019 at 3:01 AM Huizhe Wang 
> wrote:
>
>> Hi Hari,
>> Thanks :) I tried to do it as u said. It works ;)
>>
>>
>> Hariharan 于2019年5月20日 周一下午3:54写道:
>>
>>> Hi Huizhe,
>>>
>>> You can set the "fs.defaultFS" field in core-site.xml to some path on
>>> s3. That way your spark job will use S3 for all operations that need HDFS.
>>> Intermediate data will still be stored on local disk though.
>>>
>>> Thanks,
>>> Hari
>>>
>>> On Mon, May 20, 2019 at 10:14 AM Abdeali Kothari <
>>> abdealikoth...@gmail.com> wrote:
>>>
>>>> While spark can read from S3 directly in EMR, I believe it still needs
>>>> the HDFS to perform shuffles and to write intermediate data into disk when
>>>> doing jobs (I.e. when the in memory need stop spill over to disk)
>>>>
>>>> For these operations, Spark does need a distributed file system - You
>>>> could use something like EMRFS (which is like a HDFS backed by S3) on
>>>> Amazon.
>>>>
>>>> The issue could be something else too - so a stacktrace or error
>>>> message could help in understanding the problem.
>>>>
>>>>
>>>>
>>>> On Mon, May 20, 2019, 07:20 Huizhe Wang 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I wanna to use Spark on Yarn without HDFS.I store my resource in AWS
>>>>> and using s3a to get them. However, when I use stop-dfs.sh stoped Namenode
>>>>> and DataNode. I got an error when using yarn cluster mode. Could I using
>>>>> yarn without start DFS, how could I use this mode?
>>>>>
>>>>> Yours,
>>>>> Jane
>>>>>
>>>>


Re: [spark on yarn] spark on yarn without DFS

2019-05-22 Thread Gourav Sengupta
just wondering what is the advantage of doing this?

Regards
Gourav Sengupta

On Wed, May 22, 2019 at 3:01 AM Huizhe Wang  wrote:

> Hi Hari,
> Thanks :) I tried to do it as u said. It works ;)
>
>
> Hariharan 于2019年5月20日 周一下午3:54写道:
>
>> Hi Huizhe,
>>
>> You can set the "fs.defaultFS" field in core-site.xml to some path on s3.
>> That way your spark job will use S3 for all operations that need HDFS.
>> Intermediate data will still be stored on local disk though.
>>
>> Thanks,
>> Hari
>>
>> On Mon, May 20, 2019 at 10:14 AM Abdeali Kothari <
>> abdealikoth...@gmail.com> wrote:
>>
>>> While spark can read from S3 directly in EMR, I believe it still needs
>>> the HDFS to perform shuffles and to write intermediate data into disk when
>>> doing jobs (I.e. when the in memory need stop spill over to disk)
>>>
>>> For these operations, Spark does need a distributed file system - You
>>> could use something like EMRFS (which is like a HDFS backed by S3) on
>>> Amazon.
>>>
>>> The issue could be something else too - so a stacktrace or error message
>>> could help in understanding the problem.
>>>
>>>
>>>
>>> On Mon, May 20, 2019, 07:20 Huizhe Wang  wrote:
>>>
>>>> Hi,
>>>>
>>>> I wanna to use Spark on Yarn without HDFS.I store my resource in AWS
>>>> and using s3a to get them. However, when I use stop-dfs.sh stoped Namenode
>>>> and DataNode. I got an error when using yarn cluster mode. Could I using
>>>> yarn without start DFS, how could I use this mode?
>>>>
>>>> Yours,
>>>> Jane
>>>>
>>>


Re: [spark on yarn] spark on yarn without DFS

2019-05-21 Thread Huizhe Wang
Hi Hari,
Thanks :) I tried to do it as u said. It works ;)


Hariharan 于2019年5月20日 周一下午3:54写道:

> Hi Huizhe,
>
> You can set the "fs.defaultFS" field in core-site.xml to some path on s3.
> That way your spark job will use S3 for all operations that need HDFS.
> Intermediate data will still be stored on local disk though.
>
> Thanks,
> Hari
>
> On Mon, May 20, 2019 at 10:14 AM Abdeali Kothari 
> wrote:
>
>> While spark can read from S3 directly in EMR, I believe it still needs
>> the HDFS to perform shuffles and to write intermediate data into disk when
>> doing jobs (I.e. when the in memory need stop spill over to disk)
>>
>> For these operations, Spark does need a distributed file system - You
>> could use something like EMRFS (which is like a HDFS backed by S3) on
>> Amazon.
>>
>> The issue could be something else too - so a stacktrace or error message
>> could help in understanding the problem.
>>
>>
>>
>> On Mon, May 20, 2019, 07:20 Huizhe Wang  wrote:
>>
>>> Hi,
>>>
>>> I wanna to use Spark on Yarn without HDFS.I store my resource in AWS and
>>> using s3a to get them. However, when I use stop-dfs.sh stoped Namenode and
>>> DataNode. I got an error when using yarn cluster mode. Could I using yarn
>>> without start DFS, how could I use this mode?
>>>
>>> Yours,
>>> Jane
>>>
>>


Re: [spark on yarn] spark on yarn without DFS

2019-05-20 Thread JB Data31
There is a kind of check in the *yarn-site.xml*


*yarn.nodemanager.remote-app-log-dir
/var/yarn/logs*
**

Using *hdfs://:9000* as* fs.defaultFS* in *core-site.xml* you have to *hdfs
dfs -mkdir /var/yarn/logs*
Using *S3://* as * fs.defaultFS*...

Take care of *.dir* properties in* hdfs-site.xml*. Must point to local or
S3 value.

Curious to see *YARN* working without *DFS*.

@*JB*Δ <http://jbigdata.fr/jbigdata/hadoop.html>

Le lun. 20 mai 2019 à 09:54, Hariharan  a écrit :

> Hi Huizhe,
>
> You can set the "fs.defaultFS" field in core-site.xml to some path on s3.
> That way your spark job will use S3 for all operations that need HDFS.
> Intermediate data will still be stored on local disk though.
>
> Thanks,
> Hari
>
> On Mon, May 20, 2019 at 10:14 AM Abdeali Kothari 
> wrote:
>
>> While spark can read from S3 directly in EMR, I believe it still needs
>> the HDFS to perform shuffles and to write intermediate data into disk when
>> doing jobs (I.e. when the in memory need stop spill over to disk)
>>
>> For these operations, Spark does need a distributed file system - You
>> could use something like EMRFS (which is like a HDFS backed by S3) on
>> Amazon.
>>
>> The issue could be something else too - so a stacktrace or error message
>> could help in understanding the problem.
>>
>>
>>
>> On Mon, May 20, 2019, 07:20 Huizhe Wang  wrote:
>>
>>> Hi,
>>>
>>> I wanna to use Spark on Yarn without HDFS.I store my resource in AWS and
>>> using s3a to get them. However, when I use stop-dfs.sh stoped Namenode and
>>> DataNode. I got an error when using yarn cluster mode. Could I using yarn
>>> without start DFS, how could I use this mode?
>>>
>>> Yours,
>>> Jane
>>>
>>


Re: [spark on yarn] spark on yarn without DFS

2019-05-20 Thread Hariharan
Hi Huizhe,

You can set the "fs.defaultFS" field in core-site.xml to some path on s3.
That way your spark job will use S3 for all operations that need HDFS.
Intermediate data will still be stored on local disk though.

Thanks,
Hari

On Mon, May 20, 2019 at 10:14 AM Abdeali Kothari 
wrote:

> While spark can read from S3 directly in EMR, I believe it still needs the
> HDFS to perform shuffles and to write intermediate data into disk when
> doing jobs (I.e. when the in memory need stop spill over to disk)
>
> For these operations, Spark does need a distributed file system - You
> could use something like EMRFS (which is like a HDFS backed by S3) on
> Amazon.
>
> The issue could be something else too - so a stacktrace or error message
> could help in understanding the problem.
>
>
>
> On Mon, May 20, 2019, 07:20 Huizhe Wang  wrote:
>
>> Hi,
>>
>> I wanna to use Spark on Yarn without HDFS.I store my resource in AWS and
>> using s3a to get them. However, when I use stop-dfs.sh stoped Namenode and
>> DataNode. I got an error when using yarn cluster mode. Could I using yarn
>> without start DFS, how could I use this mode?
>>
>> Yours,
>> Jane
>>
>


Re: [spark on yarn] spark on yarn without DFS

2019-05-19 Thread Abdeali Kothari
While spark can read from S3 directly in EMR, I believe it still needs the
HDFS to perform shuffles and to write intermediate data into disk when
doing jobs (I.e. when the in memory need stop spill over to disk)

For these operations, Spark does need a distributed file system - You could
use something like EMRFS (which is like a HDFS backed by S3) on Amazon.

The issue could be something else too - so a stacktrace or error message
could help in understanding the problem.



On Mon, May 20, 2019, 07:20 Huizhe Wang  wrote:

> Hi,
>
> I wanna to use Spark on Yarn without HDFS.I store my resource in AWS and
> using s3a to get them. However, when I use stop-dfs.sh stoped Namenode and
> DataNode. I got an error when using yarn cluster mode. Could I using yarn
> without start DFS, how could I use this mode?
>
> Yours,
> Jane
>


Re: [spark on yarn] spark on yarn without DFS

2019-05-19 Thread Jeff Zhang
I am afraid not, because yarn needs dfs.

Huizhe Wang  于2019年5月20日周一 上午9:50写道:

> Hi,
>
> I wanna to use Spark on Yarn without HDFS.I store my resource in AWS and
> using s3a to get them. However, when I use stop-dfs.sh stoped Namenode and
> DataNode. I got an error when using yarn cluster mode. Could I using yarn
> without start DFS, how could I use this mode?
>
> Yours,
> Jane
>


-- 
Best Regards

Jeff Zhang


[spark on yarn] spark on yarn without DFS

2019-05-19 Thread Huizhe Wang
Hi,

I wanna to use Spark on Yarn without HDFS.I store my resource in AWS and
using s3a to get them. However, when I use stop-dfs.sh stoped Namenode and
DataNode. I got an error when using yarn cluster mode. Could I using yarn
without start DFS, how could I use this mode?

Yours,
Jane


Re: How to configure alluxio cluster with spark in yarn

2019-05-16 Thread Bin Fan
hi Andy

Assuming you are running Spark with YARN, then I would recommend deploying
Alluxio in the same YARN cluster if you are looking for best performance.
Alluxio can also be deployed separated as a standalone service, but in that
case, you may need to transfer data from Alluxio cluster to your Spark/YARN
cluster.

Here is the documentation
<https://docs.alluxio.io/os/user/1.8/en/deploy/Running-Alluxio-On-Yarn.html?utm_source=spark>
about
deploying Alluxio with YARN.

- Bin

On Thu, May 9, 2019 at 4:19 AM u9g  wrote:

> Hey,
>
> I want to speed up the Spark task running in the Yarn cluster through
> Alluxio. Is Alluxio recommended to run in the same yarn cluster on the yarn
> mode? Should I deploy Alluxio independently on the nodes of the yarn
> cluster? Or deploy a cluster separately?
> Best,
> Andy Li
>
>
>
>


Re: Spark on yarn - application hangs

2019-05-10 Thread Mich Talebzadeh
sure NP.

I meant these topics

[image: image.png]

Have a look at this article of mine

https://www.linkedin.com/pulse/real-time-processing-trade-data-kafka-flume-spark-talebzadeh-ph-d-/


under section

Understanding the Spark Application Through Visualization

See if it helps

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 10 May 2019 at 18:10, Mkal  wrote:

> How can i check what exactly is stagnant? Do you mean on the DAG
> visualization on Spark UI?
>
> Sorry i'm new to spark.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark on yarn - application hangs

2019-05-10 Thread Mkal
How can i check what exactly is stagnant? Do you mean on the DAG
visualization on Spark UI?

Sorry i'm new to spark.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on yarn - application hangs

2019-05-10 Thread Mich Talebzadeh
Hi,

Have you checked matrices from Spark UI by any chance? What is stagnant?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 10 May 2019 at 17:51, Mkal  wrote:

> I've built a spark job in which an external program is called through the
> use
> of pipe().
> Job runs correctly on cluster when the input is a small sample dataset but
> when the input is a real large dataset it stays on RUNNING state forever.
>
> I've tried different ways to tune executor memory, executor cores, overhead
> memory but havent found a solution so far.
> I've also tried to force external program to use only 1 thread in case
> there
> is a problem due to it being a multithread application but nothing.
>
> Any suggestion would be welcome
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Spark on yarn - application hangs

2019-05-10 Thread Mkal
I've built a spark job in which an external program is called through the use
of pipe().
Job runs correctly on cluster when the input is a small sample dataset but
when the input is a real large dataset it stays on RUNNING state forever.

I've tried different ways to tune executor memory, executor cores, overhead
memory but havent found a solution so far.
I've also tried to force external program to use only 1 thread in case there
is a problem due to it being a multithread application but nothing.

Any suggestion would be welcome



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to configure alluxio cluster with spark in yarn

2019-05-09 Thread u9g
Hey, 

I want to speed up the Spark task running in the Yarn cluster through Alluxio. 
Is Alluxio recommended to run in the same yarn cluster on the yarn mode? Should 
I deploy Alluxio independently on the nodes of the yarn cluster? Or deploy a 
cluster separately? 

Best,
Andy Li

Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-12 Thread Vadim Semenov
Yeah, then the easiest would be to fork spark and run using the forked
version, and in case of YARN it should be pretty easy to do.

git clone https://github.com/apache/spark.git

cd spark

export MAVEN_OPTS="-Xmx4g -XX:ReservedCodeCacheSize=512m"

./build/mvn -DskipTests clean package

./dev/make-distribution.sh --name custom-spark --tgz -Phadoop-2.7 -Phive
-Pyarn

ls -la spark-2.4.0-SNAPSHOT-bin-custom-spark.tgz

scp spark-2.4.0-SNAPSHOT-bin-custom-spark.tgz cluster:/tmp

export SPARK_HOME="/tmp/spark-2.3.0-SNAPSHOT-bin-custom-spark"

cd $SPARK_HOME
mv conf conf.new
ln -s /etc/spark/conf conf

echo $SPARK_HOME
spark-submit --version

On Tue, Feb 12, 2019 at 6:40 AM Serega Sheypak 
wrote:
>
> I tried a similar approach, it works well for user functions. but I need
to crash tasks or executor when spark application runs "repartition". I
didn't any away to inject "poison pill" into repartition call :(
>
> пн, 11 февр. 2019 г. в 21:19, Vadim Semenov :
>>
>> something like this
>>
>> import org.apache.spark.TaskContext
>> ds.map(r => {
>>   val taskContext = TaskContext.get()
>>   if (taskContext.partitionId == 1000) {
>> throw new RuntimeException
>>   }
>>   r
>> })
>>
>> On Mon, Feb 11, 2019 at 8:41 AM Serega Sheypak 
wrote:
>> >
>> > I need to crash task which does repartition.
>> >
>> > пн, 11 февр. 2019 г. в 10:37, Gabor Somogyi :
>> >>
>> >> What blocks you to put if conditions inside the mentioned map
function?
>> >>
>> >> On Mon, Feb 11, 2019 at 10:31 AM Serega Sheypak <
serega.shey...@gmail.com> wrote:
>> >>>
>> >>> Yeah, but I don't need to crash entire app, I want to fail several
tasks or executors and then wait for completion.
>> >>>
>> >>> вс, 10 февр. 2019 г. в 21:49, Gabor Somogyi <
gabor.g.somo...@gmail.com>:
>> 
>>  Another approach is adding artificial exception into the
application's source code like this:
>> 
>>  val query = input.toDS.map(_ /
0).writeStream.format("console").start()
>> 
>>  G
>> 
>> 
>>  On Sun, Feb 10, 2019 at 9:36 PM Serega Sheypak <
serega.shey...@gmail.com> wrote:
>> >
>> > Hi BR,
>> > thanks for your reply. I want to mimic the issue and kill tasks at
a certain stage. Killing executor is also an option for me.
>> > I'm curious how do core spark contributors test spark fault
tolerance?
>> >
>> >
>> > вс, 10 февр. 2019 г. в 16:57, Gabor Somogyi <
gabor.g.somo...@gmail.com>:
>> >>
>> >> Hi Serega,
>> >>
>> >> If I understand your problem correctly you would like to kill one
executor only and the rest of the app has to be untouched.
>> >> If that's true yarn -kill is not what you want because it stops
the whole application.
>> >>
>> >> I've done similar thing when tested/testing Spark's HA features.
>> >> - jps -vlm | grep
"org.apache.spark.executor.CoarseGrainedExecutorBackend.*applicationid"
>> >> - kill -9 pidofoneexecutor
>> >>
>> >> Be aware if it's a multi-node cluster check whether at least one
process runs on a specific node(it's not required).
>> >> Happy killing...
>> >>
>> >> BR,
>> >> G
>> >>
>> >>
>> >> On Sun, Feb 10, 2019 at 4:19 PM Jörn Franke 
wrote:
>> >>>
>> >>> yarn application -kill applicationid ?
>> >>>
>> >>> > Am 10.02.2019 um 13:30 schrieb Serega Sheypak <
serega.shey...@gmail.com>:
>> >>> >
>> >>> > Hi there!
>> >>> > I have weird issue that appears only when tasks fail at
specific stage. I would like to imitate failure on my own.
>> >>> > The plan is to run problematic app and then kill entire
executor or some tasks when execution reaches certain stage.
>> >>> >
>> >>> > Is it do-able?
>> >>>
>> >>>
-
>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >>>
>>
>>
>> --
>> Sent from my iPhone



-- 
Sent from my iPhone


Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-12 Thread Serega Sheypak
I tried a similar approach, it works well for user functions. but I need to
crash tasks or executor when spark application runs "repartition". I didn't
any away to inject "poison pill" into repartition call :(

пн, 11 февр. 2019 г. в 21:19, Vadim Semenov :

> something like this
>
> import org.apache.spark.TaskContext
> ds.map(r => {
>   val taskContext = TaskContext.get()
>   if (taskContext.partitionId == 1000) {
> throw new RuntimeException
>   }
>   r
> })
>
> On Mon, Feb 11, 2019 at 8:41 AM Serega Sheypak 
> wrote:
> >
> > I need to crash task which does repartition.
> >
> > пн, 11 февр. 2019 г. в 10:37, Gabor Somogyi :
> >>
> >> What blocks you to put if conditions inside the mentioned map function?
> >>
> >> On Mon, Feb 11, 2019 at 10:31 AM Serega Sheypak <
> serega.shey...@gmail.com> wrote:
> >>>
> >>> Yeah, but I don't need to crash entire app, I want to fail several
> tasks or executors and then wait for completion.
> >>>
> >>> вс, 10 февр. 2019 г. в 21:49, Gabor Somogyi  >:
> 
>  Another approach is adding artificial exception into the
> application's source code like this:
> 
>  val query = input.toDS.map(_ /
> 0).writeStream.format("console").start()
> 
>  G
> 
> 
>  On Sun, Feb 10, 2019 at 9:36 PM Serega Sheypak <
> serega.shey...@gmail.com> wrote:
> >
> > Hi BR,
> > thanks for your reply. I want to mimic the issue and kill tasks at a
> certain stage. Killing executor is also an option for me.
> > I'm curious how do core spark contributors test spark fault
> tolerance?
> >
> >
> > вс, 10 февр. 2019 г. в 16:57, Gabor Somogyi <
> gabor.g.somo...@gmail.com>:
> >>
> >> Hi Serega,
> >>
> >> If I understand your problem correctly you would like to kill one
> executor only and the rest of the app has to be untouched.
> >> If that's true yarn -kill is not what you want because it stops the
> whole application.
> >>
> >> I've done similar thing when tested/testing Spark's HA features.
> >> - jps -vlm | grep
> "org.apache.spark.executor.CoarseGrainedExecutorBackend.*applicationid"
> >> - kill -9 pidofoneexecutor
> >>
> >> Be aware if it's a multi-node cluster check whether at least one
> process runs on a specific node(it's not required).
> >> Happy killing...
> >>
> >> BR,
> >> G
> >>
> >>
> >> On Sun, Feb 10, 2019 at 4:19 PM Jörn Franke 
> wrote:
> >>>
> >>> yarn application -kill applicationid ?
> >>>
> >>> > Am 10.02.2019 um 13:30 schrieb Serega Sheypak <
> serega.shey...@gmail.com>:
> >>> >
> >>> > Hi there!
> >>> > I have weird issue that appears only when tasks fail at specific
> stage. I would like to imitate failure on my own.
> >>> > The plan is to run problematic app and then kill entire executor
> or some tasks when execution reaches certain stage.
> >>> >
> >>> > Is it do-able?
> >>>
> >>>
> -
> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>>
>
>
> --
> Sent from my iPhone
>


Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-11 Thread Vadim Semenov
something like this

import org.apache.spark.TaskContext
ds.map(r => {
  val taskContext = TaskContext.get()
  if (taskContext.partitionId == 1000) {
throw new RuntimeException
  }
  r
})

On Mon, Feb 11, 2019 at 8:41 AM Serega Sheypak  wrote:
>
> I need to crash task which does repartition.
>
> пн, 11 февр. 2019 г. в 10:37, Gabor Somogyi :
>>
>> What blocks you to put if conditions inside the mentioned map function?
>>
>> On Mon, Feb 11, 2019 at 10:31 AM Serega Sheypak  
>> wrote:
>>>
>>> Yeah, but I don't need to crash entire app, I want to fail several tasks or 
>>> executors and then wait for completion.
>>>
>>> вс, 10 февр. 2019 г. в 21:49, Gabor Somogyi :

 Another approach is adding artificial exception into the application's 
 source code like this:

 val query = input.toDS.map(_ / 0).writeStream.format("console").start()

 G


 On Sun, Feb 10, 2019 at 9:36 PM Serega Sheypak  
 wrote:
>
> Hi BR,
> thanks for your reply. I want to mimic the issue and kill tasks at a 
> certain stage. Killing executor is also an option for me.
> I'm curious how do core spark contributors test spark fault tolerance?
>
>
> вс, 10 февр. 2019 г. в 16:57, Gabor Somogyi :
>>
>> Hi Serega,
>>
>> If I understand your problem correctly you would like to kill one 
>> executor only and the rest of the app has to be untouched.
>> If that's true yarn -kill is not what you want because it stops the 
>> whole application.
>>
>> I've done similar thing when tested/testing Spark's HA features.
>> - jps -vlm | grep 
>> "org.apache.spark.executor.CoarseGrainedExecutorBackend.*applicationid"
>> - kill -9 pidofoneexecutor
>>
>> Be aware if it's a multi-node cluster check whether at least one process 
>> runs on a specific node(it's not required).
>> Happy killing...
>>
>> BR,
>> G
>>
>>
>> On Sun, Feb 10, 2019 at 4:19 PM Jörn Franke  wrote:
>>>
>>> yarn application -kill applicationid ?
>>>
>>> > Am 10.02.2019 um 13:30 schrieb Serega Sheypak 
>>> > :
>>> >
>>> > Hi there!
>>> > I have weird issue that appears only when tasks fail at specific 
>>> > stage. I would like to imitate failure on my own.
>>> > The plan is to run problematic app and then kill entire executor or 
>>> > some tasks when execution reaches certain stage.
>>> >
>>> > Is it do-able?
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>


-- 
Sent from my iPhone

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-11 Thread Serega Sheypak
I need to crash task which does repartition.

пн, 11 февр. 2019 г. в 10:37, Gabor Somogyi :

> What blocks you to put if conditions inside the mentioned map function?
>
> On Mon, Feb 11, 2019 at 10:31 AM Serega Sheypak 
> wrote:
>
>> Yeah, but I don't need to crash entire app, I want to fail several tasks
>> or executors and then wait for completion.
>>
>> вс, 10 февр. 2019 г. в 21:49, Gabor Somogyi :
>>
>>> Another approach is adding artificial exception into the application's
>>> source code like this:
>>>
>>> val query = input.toDS.map(_ / 0).writeStream.format("console").start()
>>>
>>> G
>>>
>>>
>>> On Sun, Feb 10, 2019 at 9:36 PM Serega Sheypak 
>>> wrote:
>>>
 Hi BR,
 thanks for your reply. I want to mimic the issue and kill tasks at a
 certain stage. Killing executor is also an option for me.
 I'm curious how do core spark contributors test spark fault tolerance?


 вс, 10 февр. 2019 г. в 16:57, Gabor Somogyi >>> >:

> Hi Serega,
>
> If I understand your problem correctly you would like to kill one
> executor only and the rest of the app has to be untouched.
> If that's true yarn -kill is not what you want because it stops the
> whole application.
>
> I've done similar thing when tested/testing Spark's HA features.
> - jps -vlm | grep
> "org.apache.spark.executor.CoarseGrainedExecutorBackend.*applicationid"
> - kill -9 pidofoneexecutor
>
> Be aware if it's a multi-node cluster check whether at least one
> process runs on a specific node(it's not required).
> Happy killing...
>
> BR,
> G
>
>
> On Sun, Feb 10, 2019 at 4:19 PM Jörn Franke 
> wrote:
>
>> yarn application -kill applicationid ?
>>
>> > Am 10.02.2019 um 13:30 schrieb Serega Sheypak <
>> serega.shey...@gmail.com>:
>> >
>> > Hi there!
>> > I have weird issue that appears only when tasks fail at specific
>> stage. I would like to imitate failure on my own.
>> > The plan is to run problematic app and then kill entire executor or
>> some tasks when execution reaches certain stage.
>> >
>> > Is it do-able?
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-11 Thread Gabor Somogyi
What blocks you to put if conditions inside the mentioned map function?

On Mon, Feb 11, 2019 at 10:31 AM Serega Sheypak 
wrote:

> Yeah, but I don't need to crash entire app, I want to fail several tasks
> or executors and then wait for completion.
>
> вс, 10 февр. 2019 г. в 21:49, Gabor Somogyi :
>
>> Another approach is adding artificial exception into the application's
>> source code like this:
>>
>> val query = input.toDS.map(_ / 0).writeStream.format("console").start()
>>
>> G
>>
>>
>> On Sun, Feb 10, 2019 at 9:36 PM Serega Sheypak 
>> wrote:
>>
>>> Hi BR,
>>> thanks for your reply. I want to mimic the issue and kill tasks at a
>>> certain stage. Killing executor is also an option for me.
>>> I'm curious how do core spark contributors test spark fault tolerance?
>>>
>>>
>>> вс, 10 февр. 2019 г. в 16:57, Gabor Somogyi :
>>>
 Hi Serega,

 If I understand your problem correctly you would like to kill one
 executor only and the rest of the app has to be untouched.
 If that's true yarn -kill is not what you want because it stops the
 whole application.

 I've done similar thing when tested/testing Spark's HA features.
 - jps -vlm | grep
 "org.apache.spark.executor.CoarseGrainedExecutorBackend.*applicationid"
 - kill -9 pidofoneexecutor

 Be aware if it's a multi-node cluster check whether at least one
 process runs on a specific node(it's not required).
 Happy killing...

 BR,
 G


 On Sun, Feb 10, 2019 at 4:19 PM Jörn Franke 
 wrote:

> yarn application -kill applicationid ?
>
> > Am 10.02.2019 um 13:30 schrieb Serega Sheypak <
> serega.shey...@gmail.com>:
> >
> > Hi there!
> > I have weird issue that appears only when tasks fail at specific
> stage. I would like to imitate failure on my own.
> > The plan is to run problematic app and then kill entire executor or
> some tasks when execution reaches certain stage.
> >
> > Is it do-able?
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-11 Thread Serega Sheypak
Yeah, but I don't need to crash entire app, I want to fail several tasks or
executors and then wait for completion.

вс, 10 февр. 2019 г. в 21:49, Gabor Somogyi :

> Another approach is adding artificial exception into the application's
> source code like this:
>
> val query = input.toDS.map(_ / 0).writeStream.format("console").start()
>
> G
>
>
> On Sun, Feb 10, 2019 at 9:36 PM Serega Sheypak 
> wrote:
>
>> Hi BR,
>> thanks for your reply. I want to mimic the issue and kill tasks at a
>> certain stage. Killing executor is also an option for me.
>> I'm curious how do core spark contributors test spark fault tolerance?
>>
>>
>> вс, 10 февр. 2019 г. в 16:57, Gabor Somogyi :
>>
>>> Hi Serega,
>>>
>>> If I understand your problem correctly you would like to kill one
>>> executor only and the rest of the app has to be untouched.
>>> If that's true yarn -kill is not what you want because it stops the
>>> whole application.
>>>
>>> I've done similar thing when tested/testing Spark's HA features.
>>> - jps -vlm | grep
>>> "org.apache.spark.executor.CoarseGrainedExecutorBackend.*applicationid"
>>> - kill -9 pidofoneexecutor
>>>
>>> Be aware if it's a multi-node cluster check whether at least one process
>>> runs on a specific node(it's not required).
>>> Happy killing...
>>>
>>> BR,
>>> G
>>>
>>>
>>> On Sun, Feb 10, 2019 at 4:19 PM Jörn Franke 
>>> wrote:
>>>
 yarn application -kill applicationid ?

 > Am 10.02.2019 um 13:30 schrieb Serega Sheypak <
 serega.shey...@gmail.com>:
 >
 > Hi there!
 > I have weird issue that appears only when tasks fail at specific
 stage. I would like to imitate failure on my own.
 > The plan is to run problematic app and then kill entire executor or
 some tasks when execution reaches certain stage.
 >
 > Is it do-able?

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org




Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-10 Thread Gabor Somogyi
Another approach is adding artificial exception into the application's
source code like this:

val query = input.toDS.map(_ / 0).writeStream.format("console").start()

G


On Sun, Feb 10, 2019 at 9:36 PM Serega Sheypak 
wrote:

> Hi BR,
> thanks for your reply. I want to mimic the issue and kill tasks at a
> certain stage. Killing executor is also an option for me.
> I'm curious how do core spark contributors test spark fault tolerance?
>
>
> вс, 10 февр. 2019 г. в 16:57, Gabor Somogyi :
>
>> Hi Serega,
>>
>> If I understand your problem correctly you would like to kill one
>> executor only and the rest of the app has to be untouched.
>> If that's true yarn -kill is not what you want because it stops the whole
>> application.
>>
>> I've done similar thing when tested/testing Spark's HA features.
>> - jps -vlm | grep
>> "org.apache.spark.executor.CoarseGrainedExecutorBackend.*applicationid"
>> - kill -9 pidofoneexecutor
>>
>> Be aware if it's a multi-node cluster check whether at least one process
>> runs on a specific node(it's not required).
>> Happy killing...
>>
>> BR,
>> G
>>
>>
>> On Sun, Feb 10, 2019 at 4:19 PM Jörn Franke  wrote:
>>
>>> yarn application -kill applicationid ?
>>>
>>> > Am 10.02.2019 um 13:30 schrieb Serega Sheypak <
>>> serega.shey...@gmail.com>:
>>> >
>>> > Hi there!
>>> > I have weird issue that appears only when tasks fail at specific
>>> stage. I would like to imitate failure on my own.
>>> > The plan is to run problematic app and then kill entire executor or
>>> some tasks when execution reaches certain stage.
>>> >
>>> > Is it do-able?
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>


Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-10 Thread Serega Sheypak
Hi BR,
thanks for your reply. I want to mimic the issue and kill tasks at a
certain stage. Killing executor is also an option for me.
I'm curious how do core spark contributors test spark fault tolerance?


вс, 10 февр. 2019 г. в 16:57, Gabor Somogyi :

> Hi Serega,
>
> If I understand your problem correctly you would like to kill one executor
> only and the rest of the app has to be untouched.
> If that's true yarn -kill is not what you want because it stops the whole
> application.
>
> I've done similar thing when tested/testing Spark's HA features.
> - jps -vlm | grep
> "org.apache.spark.executor.CoarseGrainedExecutorBackend.*applicationid"
> - kill -9 pidofoneexecutor
>
> Be aware if it's a multi-node cluster check whether at least one process
> runs on a specific node(it's not required).
> Happy killing...
>
> BR,
> G
>
>
> On Sun, Feb 10, 2019 at 4:19 PM Jörn Franke  wrote:
>
>> yarn application -kill applicationid ?
>>
>> > Am 10.02.2019 um 13:30 schrieb Serega Sheypak > >:
>> >
>> > Hi there!
>> > I have weird issue that appears only when tasks fail at specific stage.
>> I would like to imitate failure on my own.
>> > The plan is to run problematic app and then kill entire executor or
>> some tasks when execution reaches certain stage.
>> >
>> > Is it do-able?
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-10 Thread Gabor Somogyi
Hi Serega,

If I understand your problem correctly you would like to kill one executor
only and the rest of the app has to be untouched.
If that's true yarn -kill is not what you want because it stops the whole
application.

I've done similar thing when tested/testing Spark's HA features.
- jps -vlm | grep
"org.apache.spark.executor.CoarseGrainedExecutorBackend.*applicationid"
- kill -9 pidofoneexecutor

Be aware if it's a multi-node cluster check whether at least one process
runs on a specific node(it's not required).
Happy killing...

BR,
G


On Sun, Feb 10, 2019 at 4:19 PM Jörn Franke  wrote:

> yarn application -kill applicationid ?
>
> > Am 10.02.2019 um 13:30 schrieb Serega Sheypak  >:
> >
> > Hi there!
> > I have weird issue that appears only when tasks fail at specific stage.
> I would like to imitate failure on my own.
> > The plan is to run problematic app and then kill entire executor or some
> tasks when execution reaches certain stage.
> >
> > Is it do-able?
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-10 Thread Jörn Franke
yarn application -kill applicationid ?

> Am 10.02.2019 um 13:30 schrieb Serega Sheypak :
> 
> Hi there!
> I have weird issue that appears only when tasks fail at specific stage. I 
> would like to imitate failure on my own. 
> The plan is to run problematic app and then kill entire executor or some 
> tasks when execution reaches certain stage.
> 
> Is it do-able? 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark on YARN, HowTo kill executor or individual task?

2019-02-10 Thread Serega Sheypak
Hi there!
I have weird issue that appears only when tasks fail at specific stage. I
would like to imitate failure on my own.
The plan is to run problematic app and then kill entire executor or some
tasks when execution reaches certain stage.

Is it do-able?


Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-23 Thread Serega Sheypak
Hi Imran,
here is my usecase
There is 1K nodes cluster and jobs have performance degradation because of
a single node. It's rather hard to convince Cluster Ops to decommission
node because of "performance degradation". Imagine 10 dev teams chase
single ops team for valid reason (node has problems) or because code has a
bug or data is skewed or spots on the sun. We can't just decommission node
because random dev complains.

Simple solution:
- rerun failed / delayed job and blacklist "problematic" node in advance.
- Report about the problem if job works w/o anomalies.
- ops collect complains about node and start to decommission it when
"complains threshold" is reached. It's a rather low probability that many
loosely coupled teams with loosely coupled jobs complain about a single
node.


Results
- Ops are not spammed with a random requests from devs
- devs are not blocked because of the really bad node.
- it's very cheap for everyone to "blacklist" node during job submission
w/o doing anything to node.
- it's very easy to automate such behavior. Many teams use 100500 kinds of
workflow runners and the strategy is dead simple (depends on SLA of
course).
  - Just re-run failed job excluding nodes with failed tasks (if number of
nodes is reasonable)
  - Kill stuck job if it runs longer than XXX minutes and re-start
excluding nodes with long-running tasks.



ср, 23 янв. 2019 г. в 23:09, Imran Rashid :

> Serga, can you explain a bit more why you want this ability?
> If the node is really bad, wouldn't you want to decomission the NM
> entirely?
> If you've got heterogenous resources, than nodelabels seem like they would
> be more appropriate -- and I don't feel great about adding workarounds for
> the node-label limitations into blacklisting.
>
> I don't want to be stuck supporting a configuration with too limited a use
> case.
>
> (may be better to move discussion to
> https://issues.apache.org/jira/browse/SPARK-26688 so its better archived,
> I'm responding here in case you aren't watching that issue)
>
> On Tue, Jan 22, 2019 at 6:09 AM Jörn Franke  wrote:
>
>> You can try with Yarn node labels:
>>
>> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeLabel.html
>>
>> Then you can whitelist nodes.
>>
>> Am 19.01.2019 um 00:20 schrieb Serega Sheypak :
>>
>> Hi, is there any possibility to tell Scheduler to blacklist specific
>> nodes in advance?
>>
>>


Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-23 Thread Imran Rashid
Serga, can you explain a bit more why you want this ability?
If the node is really bad, wouldn't you want to decomission the NM entirely?
If you've got heterogenous resources, than nodelabels seem like they would
be more appropriate -- and I don't feel great about adding workarounds for
the node-label limitations into blacklisting.

I don't want to be stuck supporting a configuration with too limited a use
case.

(may be better to move discussion to
https://issues.apache.org/jira/browse/SPARK-26688 so its better archived,
I'm responding here in case you aren't watching that issue)

On Tue, Jan 22, 2019 at 6:09 AM Jörn Franke  wrote:

> You can try with Yarn node labels:
>
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeLabel.html
>
> Then you can whitelist nodes.
>
> Am 19.01.2019 um 00:20 schrieb Serega Sheypak :
>
> Hi, is there any possibility to tell Scheduler to blacklist specific nodes
> in advance?
>
>


Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-22 Thread Jörn Franke
You can try with Yarn node labels:
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeLabel.html

Then you can whitelist nodes.

> Am 19.01.2019 um 00:20 schrieb Serega Sheypak :
> 
> Hi, is there any possibility to tell Scheduler to blacklist specific nodes in 
> advance?


Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-22 Thread Attila Zsolt Piros
The new issue is https://issues.apache.org/jira/browse/SPARK-26688.


On Tue, Jan 22, 2019 at 11:30 AM Attila Zsolt Piros 
wrote:

> Hi,
>
> >> Is it this one: https://github.com/apache/spark/pull/23223 ?
>
> No. My old development was https://github.com/apache/spark/pull/21068,
> which is closed.
>
> This would be a new improvement with a new Apache JIRA issue (
> https://issues.apache.org) and with a new Github pull request.
>
> >> Can I try to reach you through Cloudera Support portal?
>
> It is not needed. This would be an improvement into the Apache Spark which
> details can be discussed in the JIRA / Github PR.
>
> Attila
>
>
> On Mon, Jan 21, 2019 at 10:18 PM Serega Sheypak 
> wrote:
>
>> Hi Apiros, thanks for your reply.
>>
>> Is it this one: https://github.com/apache/spark/pull/23223 ?
>> Can I try to reach you through Cloudera Support portal?
>>
>> пн, 21 янв. 2019 г. в 20:06, attilapiros :
>>
>>> Hello, I was working on this area last year (I have developed the
>>> YarnAllocatorBlacklistTracker) and if you haven't found any solution for
>>> your problem I can introduce a new config which would contain a sequence
>>> of
>>> always blacklisted nodes. This way blacklisting would improve a bit
>>> again :)
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>


Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-22 Thread Attila Zsolt Piros
Hi,

>> Is it this one: https://github.com/apache/spark/pull/23223 ?

No. My old development was https://github.com/apache/spark/pull/21068,
which is closed.

This would be a new improvement with a new Apache JIRA issue (
https://issues.apache.org) and with a new Github pull request.

>> Can I try to reach you through Cloudera Support portal?

It is not needed. This would be an improvement into the Apache Spark which
details can be discussed in the JIRA / Github PR.

Attila


On Mon, Jan 21, 2019 at 10:18 PM Serega Sheypak 
wrote:

> Hi Apiros, thanks for your reply.
>
> Is it this one: https://github.com/apache/spark/pull/23223 ?
> Can I try to reach you through Cloudera Support portal?
>
> пн, 21 янв. 2019 г. в 20:06, attilapiros :
>
>> Hello, I was working on this area last year (I have developed the
>> YarnAllocatorBlacklistTracker) and if you haven't found any solution for
>> your problem I can introduce a new config which would contain a sequence
>> of
>> always blacklisted nodes. This way blacklisting would improve a bit again
>> :)
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-21 Thread Serega Sheypak
Hi Apiros, thanks for your reply.

Is it this one: https://github.com/apache/spark/pull/23223 ?
Can I try to reach you through Cloudera Support portal?

пн, 21 янв. 2019 г. в 20:06, attilapiros :

> Hello, I was working on this area last year (I have developed the
> YarnAllocatorBlacklistTracker) and if you haven't found any solution for
> your problem I can introduce a new config which would contain a sequence of
> always blacklisted nodes. This way blacklisting would improve a bit again
> :)
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-21 Thread attilapiros
Hello, I was working on this area last year (I have developed the
YarnAllocatorBlacklistTracker) and if you haven't found any solution for
your problem I can introduce a new config which would contain a sequence of
always blacklisted nodes. This way blacklisting would improve a bit again :)



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-20 Thread Serega Sheypak
Thanks, so I'll check YARN.
Does anyone know if Spark-on-Yarn plans to expose such functionality?

сб, 19 янв. 2019 г. в 18:04, Felix Cheung :

> To clarify, yarn actually supports excluding node right when requesting
> resources. It’s spark that doesn’t provide a way to populate such a
> blacklist.
>
> If you can change yarn config, the equivalent is node label:
> https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/NodeLabel.html
>
>
>
> --
> *From:* Li Gao 
> *Sent:* Saturday, January 19, 2019 8:43 AM
> *To:* Felix Cheung
> *Cc:* Serega Sheypak; user
> *Subject:* Re: Spark on Yarn, is it possible to manually blacklist nodes
> before running spark job?
>
> on yarn it is impossible afaik. on kubernetes you can use taints to keep
> certain nodes outside of spark
>
> On Fri, Jan 18, 2019 at 9:35 PM Felix Cheung 
> wrote:
>
>> Not as far as I recall...
>>
>>
>> --
>> *From:* Serega Sheypak 
>> *Sent:* Friday, January 18, 2019 3:21 PM
>> *To:* user
>> *Subject:* Spark on Yarn, is it possible to manually blacklist nodes
>> before running spark job?
>>
>> Hi, is there any possibility to tell Scheduler to blacklist specific
>> nodes in advance?
>>
>


Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-19 Thread Felix Cheung
To clarify, yarn actually supports excluding node right when requesting 
resources. It’s spark that doesn’t provide a way to populate such a blacklist.

If you can change yarn config, the equivalent is node label: 
https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/NodeLabel.html




From: Li Gao 
Sent: Saturday, January 19, 2019 8:43 AM
To: Felix Cheung
Cc: Serega Sheypak; user
Subject: Re: Spark on Yarn, is it possible to manually blacklist nodes before 
running spark job?

on yarn it is impossible afaik. on kubernetes you can use taints to keep 
certain nodes outside of spark

On Fri, Jan 18, 2019 at 9:35 PM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Not as far as I recall...



From: Serega Sheypak mailto:serega.shey...@gmail.com>>
Sent: Friday, January 18, 2019 3:21 PM
To: user
Subject: Spark on Yarn, is it possible to manually blacklist nodes before 
running spark job?

Hi, is there any possibility to tell Scheduler to blacklist specific nodes in 
advance?


Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-19 Thread Li Gao
on yarn it is impossible afaik. on kubernetes you can use taints to keep
certain nodes outside of spark

On Fri, Jan 18, 2019 at 9:35 PM Felix Cheung 
wrote:

> Not as far as I recall...
>
>
> --
> *From:* Serega Sheypak 
> *Sent:* Friday, January 18, 2019 3:21 PM
> *To:* user
> *Subject:* Spark on Yarn, is it possible to manually blacklist nodes
> before running spark job?
>
> Hi, is there any possibility to tell Scheduler to blacklist specific nodes
> in advance?
>


Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-18 Thread Felix Cheung
Not as far as I recall...



From: Serega Sheypak 
Sent: Friday, January 18, 2019 3:21 PM
To: user
Subject: Spark on Yarn, is it possible to manually blacklist nodes before 
running spark job?

Hi, is there any possibility to tell Scheduler to blacklist specific nodes in 
advance?


Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-18 Thread Serega Sheypak
Hi, is there any possibility to tell Scheduler to blacklist specific nodes
in advance?


Re: Spark on YARN not utilizing all the YARN containers available

2018-10-10 Thread Gourav Sengupta
Hi Dillon,

yes we can understand the number of executors that are running but the
question is more around understanding the relation between YARN containers,
their persistence and SPARK excutors.

Regards,
Gourav

On Wed, Oct 10, 2018 at 6:38 AM Dillon Dukek 
wrote:

> There is documentation here
> http://spark.apache.org/docs/latest/running-on-yarn.html about running
> spark on YARN. Like I said before you can use either the logs from the
> application or the Spark UI to understand how many executors are running at
> any given time. I don't think I can help much further without more
> information about the specific use case.
>
>
> On Tue, Oct 9, 2018 at 2:54 PM Gourav Sengupta 
> wrote:
>
>> Hi Dillon,
>>
>> I do think that there is a setting available where in once YARN sets up
>> the containers then you do not deallocate them, I had used it previously in
>> HIVE, and it just saves processing time in terms of allocating containers.
>> That said I am still trying to understand how do we determine one YARN
>> container = one executor in SPARK.
>>
>> Regards,
>> Gourav
>>
>> On Tue, Oct 9, 2018 at 9:04 PM Dillon Dukek
>>  wrote:
>>
>>> I'm still not sure exactly what you are meaning by saying that you have
>>> 6 yarn containers. Yarn should just be aware of the total available
>>> resources in  your cluster and then be able to launch containers based on
>>> the executor requirements you set when you submit your job. If you can, I
>>> think it would be helpful to send me the command you're using to launch
>>> your spark process. You should also be able to use the logs and/or the
>>> spark UI to determine how many executors are running.
>>>
>>> On Tue, Oct 9, 2018 at 12:57 PM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> hi,
>>>>
>>>> may be I am not quite clear in my head on this one. But how do we know
>>>> that 1 yarn container = 1 executor?
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>> On Tue, Oct 9, 2018 at 8:53 PM Dillon Dukek
>>>>  wrote:
>>>>
>>>>> Can you send how you are launching your streaming process? Also what
>>>>> environment is this cluster running in (EMR, GCP, self managed, etc)?
>>>>>
>>>>> On Tue, Oct 9, 2018 at 10:21 AM kant kodali 
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I am using Spark 2.3.1 and using YARN as a cluster manager.
>>>>>>
>>>>>> I currently got
>>>>>>
>>>>>> 1) 6 YARN containers(executors=6) with 4 executor cores for each
>>>>>> container.
>>>>>> 2) 6 Kafka partitions from one topic.
>>>>>> 3) You can assume every other configuration is set to whatever the
>>>>>> default values are.
>>>>>>
>>>>>> Spawned a Simple Streaming Query and I see all the tasks get
>>>>>> scheduled on one YARN container. am I missing any config?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>


Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Dillon Dukek
There is documentation here
http://spark.apache.org/docs/latest/running-on-yarn.html about running
spark on YARN. Like I said before you can use either the logs from the
application or the Spark UI to understand how many executors are running at
any given time. I don't think I can help much further without more
information about the specific use case.


On Tue, Oct 9, 2018 at 2:54 PM Gourav Sengupta 
wrote:

> Hi Dillon,
>
> I do think that there is a setting available where in once YARN sets up
> the containers then you do not deallocate them, I had used it previously in
> HIVE, and it just saves processing time in terms of allocating containers.
> That said I am still trying to understand how do we determine one YARN
> container = one executor in SPARK.
>
> Regards,
> Gourav
>
> On Tue, Oct 9, 2018 at 9:04 PM Dillon Dukek
>  wrote:
>
>> I'm still not sure exactly what you are meaning by saying that you have 6
>> yarn containers. Yarn should just be aware of the total available resources
>> in  your cluster and then be able to launch containers based on the
>> executor requirements you set when you submit your job. If you can, I think
>> it would be helpful to send me the command you're using to launch your
>> spark process. You should also be able to use the logs and/or the spark UI
>> to determine how many executors are running.
>>
>> On Tue, Oct 9, 2018 at 12:57 PM Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> hi,
>>>
>>> may be I am not quite clear in my head on this one. But how do we know
>>> that 1 yarn container = 1 executor?
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Tue, Oct 9, 2018 at 8:53 PM Dillon Dukek
>>>  wrote:
>>>
>>>> Can you send how you are launching your streaming process? Also what
>>>> environment is this cluster running in (EMR, GCP, self managed, etc)?
>>>>
>>>> On Tue, Oct 9, 2018 at 10:21 AM kant kodali  wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I am using Spark 2.3.1 and using YARN as a cluster manager.
>>>>>
>>>>> I currently got
>>>>>
>>>>> 1) 6 YARN containers(executors=6) with 4 executor cores for each
>>>>> container.
>>>>> 2) 6 Kafka partitions from one topic.
>>>>> 3) You can assume every other configuration is set to whatever the
>>>>> default values are.
>>>>>
>>>>> Spawned a Simple Streaming Query and I see all the tasks get scheduled
>>>>> on one YARN container. am I missing any config?
>>>>>
>>>>> Thanks!
>>>>>
>>>>


Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Gourav Sengupta
Hi Dillon,

I do think that there is a setting available where in once YARN sets up the
containers then you do not deallocate them, I had used it previously in
HIVE, and it just saves processing time in terms of allocating containers.
That said I am still trying to understand how do we determine one YARN
container = one executor in SPARK.

Regards,
Gourav

On Tue, Oct 9, 2018 at 9:04 PM Dillon Dukek 
wrote:

> I'm still not sure exactly what you are meaning by saying that you have 6
> yarn containers. Yarn should just be aware of the total available resources
> in  your cluster and then be able to launch containers based on the
> executor requirements you set when you submit your job. If you can, I think
> it would be helpful to send me the command you're using to launch your
> spark process. You should also be able to use the logs and/or the spark UI
> to determine how many executors are running.
>
> On Tue, Oct 9, 2018 at 12:57 PM Gourav Sengupta 
> wrote:
>
>> hi,
>>
>> may be I am not quite clear in my head on this one. But how do we know
>> that 1 yarn container = 1 executor?
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Oct 9, 2018 at 8:53 PM Dillon Dukek
>>  wrote:
>>
>>> Can you send how you are launching your streaming process? Also what
>>> environment is this cluster running in (EMR, GCP, self managed, etc)?
>>>
>>> On Tue, Oct 9, 2018 at 10:21 AM kant kodali  wrote:
>>>
 Hi All,

 I am using Spark 2.3.1 and using YARN as a cluster manager.

 I currently got

 1) 6 YARN containers(executors=6) with 4 executor cores for each
 container.
 2) 6 Kafka partitions from one topic.
 3) You can assume every other configuration is set to whatever the
 default values are.

 Spawned a Simple Streaming Query and I see all the tasks get scheduled
 on one YARN container. am I missing any config?

 Thanks!

>>>


Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Dillon Dukek
I'm still not sure exactly what you are meaning by saying that you have 6
yarn containers. Yarn should just be aware of the total available resources
in  your cluster and then be able to launch containers based on the
executor requirements you set when you submit your job. If you can, I think
it would be helpful to send me the command you're using to launch your
spark process. You should also be able to use the logs and/or the spark UI
to determine how many executors are running.

On Tue, Oct 9, 2018 at 12:57 PM Gourav Sengupta 
wrote:

> hi,
>
> may be I am not quite clear in my head on this one. But how do we know
> that 1 yarn container = 1 executor?
>
> Regards,
> Gourav Sengupta
>
> On Tue, Oct 9, 2018 at 8:53 PM Dillon Dukek
>  wrote:
>
>> Can you send how you are launching your streaming process? Also what
>> environment is this cluster running in (EMR, GCP, self managed, etc)?
>>
>> On Tue, Oct 9, 2018 at 10:21 AM kant kodali  wrote:
>>
>>> Hi All,
>>>
>>> I am using Spark 2.3.1 and using YARN as a cluster manager.
>>>
>>> I currently got
>>>
>>> 1) 6 YARN containers(executors=6) with 4 executor cores for each
>>> container.
>>> 2) 6 Kafka partitions from one topic.
>>> 3) You can assume every other configuration is set to whatever the
>>> default values are.
>>>
>>> Spawned a Simple Streaming Query and I see all the tasks get scheduled
>>> on one YARN container. am I missing any config?
>>>
>>> Thanks!
>>>
>>


Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Gourav Sengupta
hi,

may be I am not quite clear in my head on this one. But how do we know that
1 yarn container = 1 executor?

Regards,
Gourav Sengupta

On Tue, Oct 9, 2018 at 8:53 PM Dillon Dukek 
wrote:

> Can you send how you are launching your streaming process? Also what
> environment is this cluster running in (EMR, GCP, self managed, etc)?
>
> On Tue, Oct 9, 2018 at 10:21 AM kant kodali  wrote:
>
>> Hi All,
>>
>> I am using Spark 2.3.1 and using YARN as a cluster manager.
>>
>> I currently got
>>
>> 1) 6 YARN containers(executors=6) with 4 executor cores for each
>> container.
>> 2) 6 Kafka partitions from one topic.
>> 3) You can assume every other configuration is set to whatever the
>> default values are.
>>
>> Spawned a Simple Streaming Query and I see all the tasks get scheduled on
>> one YARN container. am I missing any config?
>>
>> Thanks!
>>
>


Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Dillon Dukek
Can you send how you are launching your streaming process? Also what
environment is this cluster running in (EMR, GCP, self managed, etc)?

On Tue, Oct 9, 2018 at 10:21 AM kant kodali  wrote:

> Hi All,
>
> I am using Spark 2.3.1 and using YARN as a cluster manager.
>
> I currently got
>
> 1) 6 YARN containers(executors=6) with 4 executor cores for each
> container.
> 2) 6 Kafka partitions from one topic.
> 3) You can assume every other configuration is set to whatever the default
> values are.
>
> Spawned a Simple Streaming Query and I see all the tasks get scheduled on
> one YARN container. am I missing any config?
>
> Thanks!
>


Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread kant kodali
Hi All,

I am using Spark 2.3.1 and using YARN as a cluster manager.

I currently got

1) 6 YARN containers(executors=6) with 4 executor cores for each container.
2) 6 Kafka partitions from one topic.
3) You can assume every other configuration is set to whatever the default
values are.

Spawned a Simple Streaming Query and I see all the tasks get scheduled on
one YARN container. am I missing any config?

Thanks!


Re: Spark on YARN in client-mode: do we need 1 vCore for the AM?

2018-05-24 Thread Jeff Zhang
I don't think it is possible to have less than 1 core for AM, this is due
to yarn not spark.

The number of AM comparing to the number of executors should be small and
acceptable. If you do want to save more resources, I would suggest you to
use yarn cluster mode where driver and AM run in the same process.

You can either use livy or zeppelin which both support interactive work in
yarn cluster mode.

http://livy.incubator.apache.org/
https://zeppelin.apache.org/
https://medium.com/@zjffdu/zeppelin-0-8-0-new-features-ea53e8810235


Another approach to save resources is to share SparkContext across your
applications since your scenario is interactive work ( I guess it is some
kind of notebook).  Zeppelin support sharing SparkContext across users and
notes.



peay <p...@protonmail.com>于2018年5月18日周五 下午6:20写道:

> Hello,
>
> I run a Spark cluster on YARN, and we have a bunch of client-mode
> applications we use for interactive work. Whenever we start one of this, an
> application master container is started.
>
> My understanding is that this is mostly an empty shell, used to request
> further containers or get status from YARN. Is that correct?
>
> spark.yarn.am.cores is 1, and that AM gets one full vCore on the cluster.
> Because I am using DominantResourceCalculator to take vCores into account
> for scheduling, this results in a lot of unused CPU capacity overall
> because all those AMs each block one full vCore. With enough jobs, this
> adds up quickly.
>
> I am trying to understand if we can work around that -- ideally, by
> allocating fractional vCores (e.g., give 100 millicores to the AM), or by
> allocating no vCores at all for the AM (I am fine with a bit of
> oversubscription because of that).
>
> Any idea on how to avoid blocking so many YARN vCores just for the Spark
> AMs?
>
> Thanks!
>
>


Spark on YARN in client-mode: do we need 1 vCore for the AM?

2018-05-18 Thread peay
Hello,

I run a Spark cluster on YARN, and we have a bunch of client-mode applications 
we use for interactive work. Whenever we start one of this, an application 
master container is started.

My understanding is that this is mostly an empty shell, used to request further 
containers or get status from YARN. Is that correct?

spark.yarn.am.cores is 1, and that AM gets one full vCore on the cluster. 
Because I am using DominantResourceCalculator to take vCores into account for 
scheduling, this results in a lot of unused CPU capacity overall because all 
those AMs each block one full vCore. With enough jobs, this adds up quickly.

I am trying to understand if we can work around that -- ideally, by allocating 
fractional vCores (e.g., give 100 millicores to the AM), or by allocating no 
vCores at all for the AM (I am fine with a bit of oversubscription because of 
that).

Any idea on how to avoid blocking so many YARN vCores just for the Spark AMs?

Thanks!

Hortonworks Spark-Hbase-Connector does not read zookeeper configurations from spark session config ??(Spark on Yarn)

2018-02-22 Thread Dharmin Siddesh J
Hi

I am trying to write a spark code that reads data from Hbase and store it
in DataFrame.
I am able to run it perfectly with hbase-site.xml in $spark-home/conf
folder.
But I am facing few issues Here.

Issue 1: Passing hbase-site.xml location with --file parameter submitted
through client mode(It is working in cluster mode)

When I  removed hbase-site.xml from spark/conf and try to execute it in the
client mode by passing with file --file parameter over yarn I keep getting
the following exception. Which I think it means it is not taking the
zookeeper configuration from hbase-site.xml. How ever it works good when
i run it in cluster mode.
sample command

spark-submit --master yarn --deploy-mode cluster --files
/home/siddesh/hbase-site.xml --class com.orzota.rs.json.HbaseConnector
 --packages com.hortonworks:shc:1.0.0-2.0-s_2.11 --repositories
http://repo.hortonworks.com/content/groups/public/
target/scala-2.11/test-0.1-SNAPSHOT.jar

at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
18/02/22 01:43:09 INFO ClientCnxn: Opening socket connection to server
localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL
(unknown error)
18/02/22 01:43:09 WARN ClientCnxn: Session 0x0 for server null, unexpected
error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)

Issue 2: Passing hbase configuration details through spark session(Not
working in cluster as well as client mode).
Instead of passing the entire hbase-site.xml I am trying to add the
configuration directly in the spark code by adding it as a config parameter
in spark session the following is a sample spark-session command.

val spark = SparkSession
.builder()
.appName(name)

.config("hbase.zookeeper.property.clientPort", "2181")
.config("hbase.zookeeper.quorum",
"ip1,ip2,ip3")

.config("spark.hbase.host","zookeeperquorum")
.getOrCreate()

val json_df =
spark.read.option("catalog",catalog_read).format("org.apache.spark.sql.execution.datasources.hbase").load()

But it is not working in the cluster mode while the issue-1 continues in
the client mode.

Can anyone help me with a solution or explanation why this is happening are
there any work arounds ??.

regards
Sid


How to create security filter for Spark UI in Spark on YARN

2018-01-09 Thread Jhon Anderson Cardenas Diaz
*Environment*:
AWS EMR, yarn cluster.

*Description*:
I am trying to use a java filter to protect the access to spark ui, this by
using the property spark.ui.filters; the problem is that when spark is
running on yarn mode, that property is being allways overriden by hadoop
with the filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter:

*spark.ui.filters:
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter*

And this properties are automatically added:


*spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS:
ip-x-x-x-226.eu-west-1.compute.internalspark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES:
http://ip-x-x-x-226.eu-west-1.compute.internal:20888/proxy/application_x_
*

Any suggestion of how to add a java security filter so hadoop does not
override it, or maybe how to configure the security from hadoop side?

Thanks.


Re: Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-04 Thread Marcelo Vanzin
On Wed, Jan 3, 2018 at 8:18 PM, John Zhuge <john.zh...@gmail.com> wrote:
> Something like:
>
> Note: When running Spark on YARN, environment variables for the executors
> need to be set using the spark.yarn.executorEnv.[EnvironmentVariableName]
> property in your conf/spark-defaults.conf file or on the command line.
> Environment variables that are set in spark-env.sh will not be reflected in
> the executor process.

I'm not against adding docs, but that's probably true for all
backends. No backend I know sources spark-env.sh before starting
executors.

For example, the standalone worker sources spark-env.sh before
starting the daemon, and those env variables "leak" to the executors.
But you can't customize an individual executor's environment that way
without restarting the service.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-03 Thread John Zhuge
Sounds good.

Should we add another paragraph after this paragraph in configuration.md to
explain executor env as well? I will be happy to upload a simple patch.

Note: When running Spark on YARN in cluster mode, environment variables
> need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName]
>  property in your conf/spark-defaults.conf file. Environment variables
> that are set in spark-env.sh will not be reflected in the YARN
> Application Master process in clustermode. See the YARN-related Spark
> Properties
> <https://github.com/apache/spark/blob/master/docs/running-on-yarn.html#spark-properties>
>  for
> more information.


Something like:

Note: When running Spark on YARN, environment variables for the executors
need to be set using the spark.yarn.executorEnv.[EnvironmentVariableName]
property in your conf/spark-defaults.conf file or on the command line.
Environment variables that are set in spark-env.sh will not be reflected in
the executor process.



On Wed, Jan 3, 2018 at 7:53 PM, Marcelo Vanzin <van...@cloudera.com> wrote:

> Because spark-env.sh is something that makes sense only on the gateway
> machine (where the app is being submitted from).
>
> On Wed, Jan 3, 2018 at 6:46 PM, John Zhuge <john.zh...@gmail.com> wrote:
> > Thanks Jacek and Marcelo!
> >
> > Any reason it is not sourced? Any security consideration?
> >
> >
> > On Wed, Jan 3, 2018 at 9:59 AM, Marcelo Vanzin <van...@cloudera.com>
> wrote:
> >>
> >> On Tue, Jan 2, 2018 at 10:57 PM, John Zhuge <jzh...@apache.org> wrote:
> >> > I am running Spark 2.0.0 and 2.1.1 on YARN in a Hadoop 2.7.3 cluster.
> Is
> >> > spark-env.sh sourced when starting the Spark AM container or the
> >> > executor
> >> > container?
> >>
> >> No, it's not.
> >>
> >> --
> >> Marcelo
> >
> >
> >
> >
> > --
> > John
>
>
>
> --
> Marcelo
>



-- 
John


Re: Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-03 Thread Marcelo Vanzin
Because spark-env.sh is something that makes sense only on the gateway
machine (where the app is being submitted from).

On Wed, Jan 3, 2018 at 6:46 PM, John Zhuge  wrote:
> Thanks Jacek and Marcelo!
>
> Any reason it is not sourced? Any security consideration?
>
>
> On Wed, Jan 3, 2018 at 9:59 AM, Marcelo Vanzin  wrote:
>>
>> On Tue, Jan 2, 2018 at 10:57 PM, John Zhuge  wrote:
>> > I am running Spark 2.0.0 and 2.1.1 on YARN in a Hadoop 2.7.3 cluster. Is
>> > spark-env.sh sourced when starting the Spark AM container or the
>> > executor
>> > container?
>>
>> No, it's not.
>>
>> --
>> Marcelo
>
>
>
>
> --
> John



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-03 Thread John Zhuge
Thanks Jacek and Marcelo!

Any reason it is not sourced? Any security consideration?


On Wed, Jan 3, 2018 at 9:59 AM, Marcelo Vanzin  wrote:

> On Tue, Jan 2, 2018 at 10:57 PM, John Zhuge  wrote:
> > I am running Spark 2.0.0 and 2.1.1 on YARN in a Hadoop 2.7.3 cluster. Is
> > spark-env.sh sourced when starting the Spark AM container or the executor
> > container?
>
> No, it's not.
>
> --
> Marcelo
>



-- 
John


Re: Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-03 Thread Marcelo Vanzin
On Tue, Jan 2, 2018 at 10:57 PM, John Zhuge  wrote:
> I am running Spark 2.0.0 and 2.1.1 on YARN in a Hadoop 2.7.3 cluster. Is
> spark-env.sh sourced when starting the Spark AM container or the executor
> container?

No, it's not.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-03 Thread Jacek Laskowski
Hi,

My understanding is that AM with the driver (in cluster deploy mode) and
executors are simple Java processes with their settings set one by one
while submitting a Spark application for execution and creating
ContainerLaunchContext for launching YARN containers. See
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala?utf8=%E2%9C%93#L796-L801
for the code that does the settings to properties mapping.

With that I think conf/spark-defaults.conf won't be loaded by itself.

Why don't you set a property and see if it's available on the driver in
cluster deploy mode? That should give you a definitive answer (or at least
get you closer).

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Wed, Jan 3, 2018 at 7:57 AM, John Zhuge <jzh...@apache.org> wrote:

> Hi,
>
> I am running Spark 2.0.0 and 2.1.1 on YARN in a Hadoop 2.7.3 cluster. Is
> spark-env.sh sourced when starting the Spark AM container or the executor
> container?
>
> Saw this paragraph on https://github.com/apache/spark/blob/master/docs/
> configuration.md:
>
> Note: When running Spark on YARN in cluster mode, environment variables
>> need to be set using the spark.yarn.appMasterEnv.[
>> EnvironmentVariableName] property in your conf/spark-defaults.conf file.
>> Environment variables that are set in spark-env.sh will not be reflected
>> in the YARN Application Master process in clustermode. See the YARN-related
>> Spark Properties
>> <https://github.com/apache/spark/blob/master/docs/running-on-yarn.html#spark-properties>
>>  for
>> more information.
>
>
> Does it mean spark-env.sh will not be sourced when starting AM in cluster
> mode?
> Does this paragraph appy to executor as well?
>
> Thanks,
> --
> John Zhuge
>


Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-02 Thread John Zhuge
Hi,

I am running Spark 2.0.0 and 2.1.1 on YARN in a Hadoop 2.7.3 cluster. Is
spark-env.sh sourced when starting the Spark AM container or the executor
container?

Saw this paragraph on
https://github.com/apache/spark/blob/master/docs/configuration.md:

Note: When running Spark on YARN in cluster mode, environment variables
> need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] 
> property
> in your conf/spark-defaults.conf file. Environment variables that are set
> in spark-env.sh will not be reflected in the YARN Application Master
> process in clustermode. See the YARN-related Spark Properties
> <https://github.com/apache/spark/blob/master/docs/running-on-yarn.html#spark-properties>
>  for
> more information.


Does it mean spark-env.sh will not be sourced when starting AM in cluster
mode?
Does this paragraph appy to executor as well?

Thanks,
-- 
John Zhuge


[Spark on YARN] Asynchronously launching containers in YARN

2017-10-13 Thread Craig Ingram
I was recently doing some research into Spark on YARN's startup time and
observed slow, synchronous allocation of containers/executors. I am testing
on a 4 node bare metal cluster w/48 cores and 128GB memory per node. YARN
was only allocating about 3 containers per second. Moreover when starting 3
Spark applications at the same time with each requesting 44 containers, the
first application would get all 44 requested containers and then the next
application would start getting containers and so on.

>From looking at the code, it appears this is by design. There is an
undocumented configuration variable that will enable asynchronous allocation
of containers. I'm sure I'm missing something, but why is this not the
default? Is there a bug or race condition in this code path? I've done some
testing with it and it's been working and is significantly faster.

Here's the config:
`yarn.scheduler.capacity.schedule-asynchronously.enable`

I created a JIRA ticket in YARN's project, but I am curious if anyone else
has experience similar issues or have tested this configuration extensively.

YARN-7327   



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Port to open for submitting Spark on Yarn application

2017-09-03 Thread Satoshi Yamada
Jerry,

Thanks for your comment.

On Mon, Sep 4, 2017 at 10:43 AM, Saisai Shao <sai.sai.s...@gmail.com> wrote:

> I think spark.yarn.am.port is not used any more, so you don't need to
> consider this.
>
> If you're running Spark on YARN, I think some YARN RM port to submit
> applications should also be reachable via firewall, as well as HDFS port to
> upload resources.
>
> Also in the Spark side, executors will be connected to driver via
> spark.driver.port, maybe you should also set a fixed port number for this
> and add to white list of firewall.
>
> Thanks
> Jerry
>
>
> On Mon, Sep 4, 2017 at 8:50 AM, Satoshi Yamada <
> satoshi.yamada....@gmail.com> wrote:
>
>> Hi,
>>
>> In case we run Spark on Yarn in client mode, we have firewall for Hadoop 
>> cluster,
>> and the client node is outside firewall, I think I have to open some ports
>> that Application Master uses.
>>
>>
>> I think the ports is specified by "spark.yarn.am.port" as document says.
>> https://spark.apache.org/docs/latest/running-on-yarn.html
>>
>> But, according to the source code, spark.yarn.am.port is deprecated since 
>> 2.0.
>> https://github.com/apache/spark/commit/829cd7b8b70e65a91aa66e6d626bd45f18e0ad97
>>
>> Does this mean we do not need to open particular ports of firewall for
>>
>> Spark on Yarn?
>>
>>
>> Thanks,
>>
>>
>


Re: Port to open for submitting Spark on Yarn application

2017-09-03 Thread Saisai Shao
I think spark.yarn.am.port is not used any more, so you don't need to
consider this.

If you're running Spark on YARN, I think some YARN RM port to submit
applications should also be reachable via firewall, as well as HDFS port to
upload resources.

Also in the Spark side, executors will be connected to driver via
spark.driver.port, maybe you should also set a fixed port number for this
and add to white list of firewall.

Thanks
Jerry


On Mon, Sep 4, 2017 at 8:50 AM, Satoshi Yamada <satoshi.yamada@gmail.com
> wrote:

> Hi,
>
> In case we run Spark on Yarn in client mode, we have firewall for Hadoop 
> cluster,
> and the client node is outside firewall, I think I have to open some ports
> that Application Master uses.
>
>
> I think the ports is specified by "spark.yarn.am.port" as document says.
> https://spark.apache.org/docs/latest/running-on-yarn.html
>
> But, according to the source code, spark.yarn.am.port is deprecated since 2.0.
> https://github.com/apache/spark/commit/829cd7b8b70e65a91aa66e6d626bd45f18e0ad97
>
> Does this mean we do not need to open particular ports of firewall for
>
> Spark on Yarn?
>
>
> Thanks,
>
>


Port to open for submitting Spark on Yarn application

2017-09-03 Thread Satoshi Yamada
Hi,

In case we run Spark on Yarn in client mode, we have firewall for
Hadoop cluster,
and the client node is outside firewall, I think I have to open some ports
that Application Master uses.


I think the ports is specified by "spark.yarn.am.port" as document says.
https://spark.apache.org/docs/latest/running-on-yarn.html

But, according to the source code, spark.yarn.am.port is deprecated since 2.0.
https://github.com/apache/spark/commit/829cd7b8b70e65a91aa66e6d626bd45f18e0ad97

Does this mean we do not need to open particular ports of firewall for

Spark on Yarn?


Thanks,


Re: How to configure spark on Yarn cluster

2017-07-28 Thread yohann jardin
For yarn, I'm speaking about the file fairscheduler.xml (if you kept the 
default scheduling of Yarn): 
https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Allocation_file_format


Yohann Jardin

Le 7/28/2017 à 8:00 PM, jeff saremi a écrit :

The only relevant setting i see in Yarn is this:

  
yarn.nodemanager.resource.memory-mb
120726
  
which is 120GB and we are well below that. I don't see a total limit.

I haven't played with spark.memory.fraction. I'm not sure if it makes a 
difference. Note that there are no errors coming from Spark with respect to 
memory being an issue. Yarn kills the JVM and just prints out one line: Out of 
memory in the stdout of the container. After that Spark complains about the 
ExecutorLostFailure. So the memory factions are not playing a factor here.
I just looked at the link you included. Thank you. Yes this is the same problem 
however it looks like no one has come up with a solution for this problem yet



From: yohann jardin <yohannjar...@hotmail.com><mailto:yohannjar...@hotmail.com>
Sent: Friday, July 28, 2017 10:47:40 AM
To: jeff saremi; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: How to configure spark on Yarn cluster


Not sure that we are OK on one thing: Yarn limitations are for the sum of all 
nodes, while you only specify the memory for a single node through Spark.


By the way, the memory displayed in the UI is only a part of the total memory 
allocation: 
https://spark.apache.org/docs/latest/configuration.html#memory-management

It corresponds to “spark.memory.fraction”, so it will mainly be filled by the 
rdd you’re trying to persist. The memory left by this parameter will be used to 
read the input file and compute. When the fail comes from this, the Out Of 
Memory exception is quite explicit in the driver logs.

Testing sampled files of 1 GB, 1 TB, 10 TB should help investigate what goes 
right and what goes wrong at least a bit.


Also, did you check for similar issues on stackoverflow? Like 
https://stackoverflow.com/questions/40781354/container-killed-by-yarn-for-exceeding-memory-limits-10-4-gb-of-10-4-gb-physic


Regards,

Yohann Jardin

Le 7/28/2017 à 6:05 PM, jeff saremi a écrit :

Thanks so much Yohann

I checked the Storage/Memory column in Executors status page. Well below where 
I wanted to be.
I will try the suggestion on smaller data sets.
I am also well within the Yarn limitations (128GB). In my last try I asked for 
48+32 (overhead). So somehow I am exceeding that or I should say Spark is 
exceeding since I am trusting to manage the memory I provided for it.
Is there anything in Shuffle Write Size, Shuffle Spill, or anything in the logs 
that I should be looking for to come up with the recommended memory size or 
partition count?

thanks


From: yohann jardin <yohannjar...@hotmail.com><mailto:yohannjar...@hotmail.com>
Sent: Thursday, July 27, 2017 11:15:39 PM
To: jeff saremi; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: How to configure spark on Yarn cluster


Check the executor page of the Spark UI, to check if your storage level is 
limiting.


Also, instead of starting with 100 TB of data, sample it, make it work, and 
grow it little by little until you reached 100 TB. This will validate the 
workflow and let you see how much data is shuffled, etc.


And just in case, check the limits you set on your Yarn queue. If you try to 
allocate more memory to your job than what is set on the queue, there might be 
cases of failure.
Though there are some limitations, it’s possible to allocate more ram to your 
job than available on your Yarn queue.


Yohann Jardin

Le 7/28/2017 à 8:03 AM, jeff saremi a écrit :

I have the simplest job which i'm running against 100TB of data. The job keeps 
failing with ExecutorLostFailure's on containers killed by Yarn for exceeding 
memory limits

I have varied the executor-memory from 32GB to 96GB, the 
spark.yarn.executor.memoryOverhead from 8192 to 36000 and similar changes to 
the number of cores, and driver size. It looks like nothing stops this error 
(running out of memory) from happening. Looking at metrics reported by Spark 
status page, is there anything I can use to configure my job properly? Is 
repartitioning more or less going to help at all? The current number of 
partitions is around 40,000 currently.

Here's the gist of the code:


val input = sc.textFile(path)

val t0 = input.map(s => s.split("\t").map(a => ((a(0),a(1)), a)))

t0.persist(StorageLevel.DISK_ONLY)


I have changed storagelevel from MEMORY_ONLY to MEMORY_AND_DISK to DISK_ONLY to 
no avail.







Re: How to configure spark on Yarn cluster

2017-07-28 Thread jeff saremi
The only relevant setting i see in Yarn is this:

  
yarn.nodemanager.resource.memory-mb
120726
  
which is 120GB and we are well below that. I don't see a total limit.

I haven't played with spark.memory.fraction. I'm not sure if it makes a 
difference. Note that there are no errors coming from Spark with respect to 
memory being an issue. Yarn kills the JVM and just prints out one line: Out of 
memory in the stdout of the container. After that Spark complains about the 
ExecutorLostFailure. So the memory factions are not playing a factor here.
I just looked at the link you included. Thank you. Yes this is the same problem 
however it looks like no one has come up with a solution for this problem yet



From: yohann jardin <yohannjar...@hotmail.com>
Sent: Friday, July 28, 2017 10:47:40 AM
To: jeff saremi; user@spark.apache.org
Subject: Re: How to configure spark on Yarn cluster


Not sure that we are OK on one thing: Yarn limitations are for the sum of all 
nodes, while you only specify the memory for a single node through Spark.


By the way, the memory displayed in the UI is only a part of the total memory 
allocation: 
https://spark.apache.org/docs/latest/configuration.html#memory-management

It corresponds to “spark.memory.fraction”, so it will mainly be filled by the 
rdd you’re trying to persist. The memory left by this parameter will be used to 
read the input file and compute. When the fail comes from this, the Out Of 
Memory exception is quite explicit in the driver logs.

Testing sampled files of 1 GB, 1 TB, 10 TB should help investigate what goes 
right and what goes wrong at least a bit.


Also, did you check for similar issues on stackoverflow? Like 
https://stackoverflow.com/questions/40781354/container-killed-by-yarn-for-exceeding-memory-limits-10-4-gb-of-10-4-gb-physic


Regards,

Yohann Jardin

Le 7/28/2017 à 6:05 PM, jeff saremi a écrit :

Thanks so much Yohann

I checked the Storage/Memory column in Executors status page. Well below where 
I wanted to be.
I will try the suggestion on smaller data sets.
I am also well within the Yarn limitations (128GB). In my last try I asked for 
48+32 (overhead). So somehow I am exceeding that or I should say Spark is 
exceeding since I am trusting to manage the memory I provided for it.
Is there anything in Shuffle Write Size, Shuffle Spill, or anything in the logs 
that I should be looking for to come up with the recommended memory size or 
partition count?

thanks


From: yohann jardin <yohannjar...@hotmail.com><mailto:yohannjar...@hotmail.com>
Sent: Thursday, July 27, 2017 11:15:39 PM
To: jeff saremi; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: How to configure spark on Yarn cluster


Check the executor page of the Spark UI, to check if your storage level is 
limiting.


Also, instead of starting with 100 TB of data, sample it, make it work, and 
grow it little by little until you reached 100 TB. This will validate the 
workflow and let you see how much data is shuffled, etc.


And just in case, check the limits you set on your Yarn queue. If you try to 
allocate more memory to your job than what is set on the queue, there might be 
cases of failure.
Though there are some limitations, it’s possible to allocate more ram to your 
job than available on your Yarn queue.


Yohann Jardin

Le 7/28/2017 à 8:03 AM, jeff saremi a écrit :

I have the simplest job which i'm running against 100TB of data. The job keeps 
failing with ExecutorLostFailure's on containers killed by Yarn for exceeding 
memory limits

I have varied the executor-memory from 32GB to 96GB, the 
spark.yarn.executor.memoryOverhead from 8192 to 36000 and similar changes to 
the number of cores, and driver size. It looks like nothing stops this error 
(running out of memory) from happening. Looking at metrics reported by Spark 
status page, is there anything I can use to configure my job properly? Is 
repartitioning more or less going to help at all? The current number of 
partitions is around 40,000 currently.

Here's the gist of the code:


val input = sc.textFile(path)

val t0 = input.map(s => s.split("\t").map(a => ((a(0),a(1)), a)))

t0.persist(StorageLevel.DISK_ONLY)


I have changed storagelevel from MEMORY_ONLY to MEMORY_AND_DISK to DISK_ONLY to 
no avail.






Re: How to configure spark on Yarn cluster

2017-07-28 Thread yohann jardin
Not sure that we are OK on one thing: Yarn limitations are for the sum of all 
nodes, while you only specify the memory for a single node through Spark.


By the way, the memory displayed in the UI is only a part of the total memory 
allocation: 
https://spark.apache.org/docs/latest/configuration.html#memory-management

It corresponds to “spark.memory.fraction”, so it will mainly be filled by the 
rdd you’re trying to persist. The memory left by this parameter will be used to 
read the input file and compute. When the fail comes from this, the Out Of 
Memory exception is quite explicit in the driver logs.

Testing sampled files of 1 GB, 1 TB, 10 TB should help investigate what goes 
right and what goes wrong at least a bit.


Also, did you check for similar issues on stackoverflow? Like 
https://stackoverflow.com/questions/40781354/container-killed-by-yarn-for-exceeding-memory-limits-10-4-gb-of-10-4-gb-physic


Regards,

Yohann Jardin

Le 7/28/2017 à 6:05 PM, jeff saremi a écrit :

Thanks so much Yohann

I checked the Storage/Memory column in Executors status page. Well below where 
I wanted to be.
I will try the suggestion on smaller data sets.
I am also well within the Yarn limitations (128GB). In my last try I asked for 
48+32 (overhead). So somehow I am exceeding that or I should say Spark is 
exceeding since I am trusting to manage the memory I provided for it.
Is there anything in Shuffle Write Size, Shuffle Spill, or anything in the logs 
that I should be looking for to come up with the recommended memory size or 
partition count?

thanks


From: yohann jardin <yohannjar...@hotmail.com><mailto:yohannjar...@hotmail.com>
Sent: Thursday, July 27, 2017 11:15:39 PM
To: jeff saremi; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: How to configure spark on Yarn cluster


Check the executor page of the Spark UI, to check if your storage level is 
limiting.


Also, instead of starting with 100 TB of data, sample it, make it work, and 
grow it little by little until you reached 100 TB. This will validate the 
workflow and let you see how much data is shuffled, etc.


And just in case, check the limits you set on your Yarn queue. If you try to 
allocate more memory to your job than what is set on the queue, there might be 
cases of failure.
Though there are some limitations, it’s possible to allocate more ram to your 
job than available on your Yarn queue.


Yohann Jardin

Le 7/28/2017 à 8:03 AM, jeff saremi a écrit :

I have the simplest job which i'm running against 100TB of data. The job keeps 
failing with ExecutorLostFailure's on containers killed by Yarn for exceeding 
memory limits

I have varied the executor-memory from 32GB to 96GB, the 
spark.yarn.executor.memoryOverhead from 8192 to 36000 and similar changes to 
the number of cores, and driver size. It looks like nothing stops this error 
(running out of memory) from happening. Looking at metrics reported by Spark 
status page, is there anything I can use to configure my job properly? Is 
repartitioning more or less going to help at all? The current number of 
partitions is around 40,000 currently.

Here's the gist of the code:


val input = sc.textFile(path)

val t0 = input.map(s => s.split("\t").map(a => ((a(0),a(1)), a)))

t0.persist(StorageLevel.DISK_ONLY)


I have changed storagelevel from MEMORY_ONLY to MEMORY_AND_DISK to DISK_ONLY to 
no avail.






Re: How to configure spark on Yarn cluster

2017-07-28 Thread jeff saremi
Thanks so much Yohann

I checked the Storage/Memory column in Executors status page. Well below where 
I wanted to be.
I will try the suggestion on smaller data sets.
I am also well within the Yarn limitations (128GB). In my last try I asked for 
48+32 (overhead). So somehow I am exceeding that or I should say Spark is 
exceeding since I am trusting to manage the memory I provided for it.
Is there anything in Shuffle Write Size, Shuffle Spill, or anything in the logs 
that I should be looking for to come up with the recommended memory size or 
partition count?

thanks


From: yohann jardin <yohannjar...@hotmail.com>
Sent: Thursday, July 27, 2017 11:15:39 PM
To: jeff saremi; user@spark.apache.org
Subject: Re: How to configure spark on Yarn cluster


Check the executor page of the Spark UI, to check if your storage level is 
limiting.


Also, instead of starting with 100 TB of data, sample it, make it work, and 
grow it little by little until you reached 100 TB. This will validate the 
workflow and let you see how much data is shuffled, etc.


And just in case, check the limits you set on your Yarn queue. If you try to 
allocate more memory to your job than what is set on the queue, there might be 
cases of failure.
Though there are some limitations, it’s possible to allocate more ram to your 
job than available on your Yarn queue.


Yohann Jardin

Le 7/28/2017 à 8:03 AM, jeff saremi a écrit :

I have the simplest job which i'm running against 100TB of data. The job keeps 
failing with ExecutorLostFailure's on containers killed by Yarn for exceeding 
memory limits

I have varied the executor-memory from 32GB to 96GB, the 
spark.yarn.executor.memoryOverhead from 8192 to 36000 and similar changes to 
the number of cores, and driver size. It looks like nothing stops this error 
(running out of memory) from happening. Looking at metrics reported by Spark 
status page, is there anything I can use to configure my job properly? Is 
repartitioning more or less going to help at all? The current number of 
partitions is around 40,000 currently.

Here's the gist of the code:


val input = sc.textFile(path)

val t0 = input.map(s => s.split("\t").map(a => ((a(0),a(1)), a)))

t0.persist(StorageLevel.DISK_ONLY)


I have changed storagelevel from MEMORY_ONLY to MEMORY_AND_DISK to DISK_ONLY to 
no avail.





Re: How to configure spark on Yarn cluster

2017-07-28 Thread yohann jardin
Check the executor page of the Spark UI, to check if your storage level is 
limiting.


Also, instead of starting with 100 TB of data, sample it, make it work, and 
grow it little by little until you reached 100 TB. This will validate the 
workflow and let you see how much data is shuffled, etc.


And just in case, check the limits you set on your Yarn queue. If you try to 
allocate more memory to your job than what is set on the queue, there might be 
cases of failure.
Though there are some limitations, it’s possible to allocate more ram to your 
job than available on your Yarn queue.


Yohann Jardin

Le 7/28/2017 à 8:03 AM, jeff saremi a écrit :

I have the simplest job which i'm running against 100TB of data. The job keeps 
failing with ExecutorLostFailure's on containers killed by Yarn for exceeding 
memory limits

I have varied the executor-memory from 32GB to 96GB, the 
spark.yarn.executor.memoryOverhead from 8192 to 36000 and similar changes to 
the number of cores, and driver size. It looks like nothing stops this error 
(running out of memory) from happening. Looking at metrics reported by Spark 
status page, is there anything I can use to configure my job properly? Is 
repartitioning more or less going to help at all? The current number of 
partitions is around 40,000 currently.

Here's the gist of the code:


val input = sc.textFile(path)

val t0 = input.map(s => s.split("\t").map(a => ((a(0),a(1)), a)))

t0.persist(StorageLevel.DISK_ONLY)


I have changed storagelevel from MEMORY_ONLY to MEMORY_AND_DISK to DISK_ONLY to 
no avail.





How to configure spark on Yarn cluster

2017-07-28 Thread jeff saremi
I have the simplest job which i'm running against 100TB of data. The job keeps 
failing with ExecutorLostFailure's on containers killed by Yarn for exceeding 
memory limits

I have varied the executor-memory from 32GB to 96GB, the 
spark.yarn.executor.memoryOverhead from 8192 to 36000 and similar changes to 
the number of cores, and driver size. It looks like nothing stops this error 
(running out of memory) from happening. Looking at metrics reported by Spark 
status page, is there anything I can use to configure my job properly? Is 
repartitioning more or less going to help at all? The current number of 
partitions is around 40,000 currently.

Here's the gist of the code:


val input = sc.textFile(path)

val t0 = input.map(s => s.split("\t").map(a => ((a(0),a(1)), a)))

t0.persist(StorageLevel.DISK_ONLY)


I have changed storagelevel from MEMORY_ONLY to MEMORY_AND_DISK to DISK_ONLY to 
no avail.




Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Takashi Sasaki
Hi Josh,


As you say, I also recognize the problem. I feel I got a warning when
specifying a huge data set.


We also adjust the partition size but we are doing command options
instead of default settings, or in code.


Regards,

Takashi

2017-07-18 6:48 GMT+09:00 Josh Holbrook :
> I just ran into this issue! Small world.
>
> As far as I can tell, by default spark on EMR is completely untuned, but it
> comes with a flag that you can set to tell EMR to autotune spark. In your
> configuration.json file, you can add something like:
>
>   {
> "Classification": "spark",
> "Properties": {
>   "maximizeResourceAllocation": "true"
> }
>   },
>
> but keep in mind that, again as far as I can tell, the default parallelism
> with this config is merely twice the number of executor cores--so for a 10
> machine cluster w/ 3 active cores each, 60 partitions. This is pretty low,
> so you'll likely want to adjust this--I'm currently using the following
> because spark chokes on datasets that are bigger than about 2g per
> partition:
>
>   {
> "Classification": "spark-defaults",
> "Properties": {
>   "spark.default.parallelism": "1000"
> }
>   }
>
> Good luck, and I hope this is helpful!
>
> --Josh
>
>
> On Mon, Jul 17, 2017 at 4:59 PM, Takashi Sasaki 
> wrote:
>>
>> Hi Pascal,
>>
>> The error also occurred frequently in our project.
>>
>> As a solution, it was effective to specify the memory size directly
>> with spark-submit command.
>>
>> eg. spark-submit executor-memory 2g
>>
>>
>> Regards,
>>
>> Takashi
>>
>> > 2017-07-18 5:18 GMT+09:00 Pascal Stammer :
>> >> Hi,
>> >>
>> >> I am running a Spark 2.1.x Application on AWS EMR with YARN and get
>> >> following error that kill my application:
>> >>
>> >> AM Container for appattempt_1500320286695_0001_01 exited with
>> >> exitCode:
>> >> -104
>> >> For more detailed output, check application tracking
>> >>
>> >> page:http://ip-172-31-35-192.eu-central-1.compute.internal:8088/cluster/app/application_1500320286695_0001Then,
>> >> click on links to logs of each attempt.
>> >> Diagnostics: Container
>> >> [pid=9216,containerID=container_1500320286695_0001_01_01] is
>> >> running
>> >> beyond physical memory limits. Current usage: 1.4 GB of 1.4 GB physical
>> >> memory used; 3.3 GB of 6.9 GB virtual memory used. Killing container.
>> >>
>> >>
>> >> I already change spark.yarn.executor.memoryOverhead but the error still
>> >> occurs. Does anybody have a hint for me which parameter or
>> >> configuration I
>> >> have to adapt.
>> >>
>> >> Thank you very much.
>> >>
>> >> Regards,
>> >>
>> >> Pascal Stammer
>> >>
>> >>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Josh Holbrook
I just ran into this issue! Small world.

As far as I can tell, by default spark on EMR is completely untuned, but it
comes with a flag that you can set to tell EMR to autotune spark. In your
configuration.json file, you can add something like:

  {
"Classification": "spark",
"Properties": {
  "maximizeResourceAllocation": "true"
}
  },

but keep in mind that, again as far as I can tell, the default parallelism
with this config is merely twice the number of executor cores--so for a 10
machine cluster w/ 3 active cores each, 60 partitions. This is pretty low,
so you'll likely want to adjust this--I'm currently using the following
because spark chokes on datasets that are bigger than about 2g per
partition:

  {
"Classification": "spark-defaults",
"Properties": {
  "spark.default.parallelism": "1000"
}
  }

Good luck, and I hope this is helpful!

--Josh


On Mon, Jul 17, 2017 at 4:59 PM, Takashi Sasaki 
wrote:

> Hi Pascal,
>
> The error also occurred frequently in our project.
>
> As a solution, it was effective to specify the memory size directly
> with spark-submit command.
>
> eg. spark-submit executor-memory 2g
>
>
> Regards,
>
> Takashi
>
> > 2017-07-18 5:18 GMT+09:00 Pascal Stammer :
> >> Hi,
> >>
> >> I am running a Spark 2.1.x Application on AWS EMR with YARN and get
> >> following error that kill my application:
> >>
> >> AM Container for appattempt_1500320286695_0001_01 exited with
> exitCode:
> >> -104
> >> For more detailed output, check application tracking
> >> page:http://ip-172-31-35-192.eu-central-1.compute.internal:
> 8088/cluster/app/application_1500320286695_0001Then,
> >> click on links to logs of each attempt.
> >> Diagnostics: Container
> >> [pid=9216,containerID=container_1500320286695_0001_01_01] is
> running
> >> beyond physical memory limits. Current usage: 1.4 GB of 1.4 GB physical
> >> memory used; 3.3 GB of 6.9 GB virtual memory used. Killing container.
> >>
> >>
> >> I already change spark.yarn.executor.memoryOverhead but the error still
> >> occurs. Does anybody have a hint for me which parameter or
> configuration I
> >> have to adapt.
> >>
> >> Thank you very much.
> >>
> >> Regards,
> >>
> >> Pascal Stammer
> >>
> >>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Pascal Stammer

Hi Takashi,

thanks for your help. After a further investigation, I figure out that the 
killed container was the driver process. After setting 
spark.yarn.driver.memoryOverhead instead of spark.yarn.executor.memoryOverhead 
the error was gone and application is executed without error. Maybe it will 
help you as well.

Regards,

Pascal 




> Am 17.07.2017 um 22:59 schrieb Takashi Sasaki :
> 
> Hi Pascal,
> 
> The error also occurred frequently in our project.
> 
> As a solution, it was effective to specify the memory size directly
> with spark-submit command.
> 
> eg. spark-submit executor-memory 2g
> 
> 
> Regards,
> 
> Takashi
> 
>> 2017-07-18 5:18 GMT+09:00 Pascal Stammer :
>>> Hi,
>>> 
>>> I am running a Spark 2.1.x Application on AWS EMR with YARN and get
>>> following error that kill my application:
>>> 
>>> AM Container for appattempt_1500320286695_0001_01 exited with exitCode:
>>> -104
>>> For more detailed output, check application tracking
>>> page:http://ip-172-31-35-192.eu-central-1.compute.internal:8088/cluster/app/application_1500320286695_0001Then,
>>> click on links to logs of each attempt.
>>> Diagnostics: Container
>>> [pid=9216,containerID=container_1500320286695_0001_01_01] is running
>>> beyond physical memory limits. Current usage: 1.4 GB of 1.4 GB physical
>>> memory used; 3.3 GB of 6.9 GB virtual memory used. Killing container.
>>> 
>>> 
>>> I already change spark.yarn.executor.memoryOverhead but the error still
>>> occurs. Does anybody have a hint for me which parameter or configuration I
>>> have to adapt.
>>> 
>>> Thank you very much.
>>> 
>>> Regards,
>>> 
>>> Pascal Stammer
>>> 
>>> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 



Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Takashi Sasaki
Hi Pascal,

The error also occurred frequently in our project.

As a solution, it was effective to specify the memory size directly
with spark-submit command.

eg. spark-submit executor-memory 2g


Regards,

Takashi

> 2017-07-18 5:18 GMT+09:00 Pascal Stammer :
>> Hi,
>>
>> I am running a Spark 2.1.x Application on AWS EMR with YARN and get
>> following error that kill my application:
>>
>> AM Container for appattempt_1500320286695_0001_01 exited with exitCode:
>> -104
>> For more detailed output, check application tracking
>> page:http://ip-172-31-35-192.eu-central-1.compute.internal:8088/cluster/app/application_1500320286695_0001Then,
>> click on links to logs of each attempt.
>> Diagnostics: Container
>> [pid=9216,containerID=container_1500320286695_0001_01_01] is running
>> beyond physical memory limits. Current usage: 1.4 GB of 1.4 GB physical
>> memory used; 3.3 GB of 6.9 GB virtual memory used. Killing container.
>>
>>
>> I already change spark.yarn.executor.memoryOverhead but the error still
>> occurs. Does anybody have a hint for me which parameter or configuration I
>> have to adapt.
>>
>> Thank you very much.
>>
>> Regards,
>>
>> Pascal Stammer
>>
>>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Running Spark und YARN on AWS EMR

2017-07-17 Thread Pascal Stammer
Hi,

I am running a Spark 2.1.x Application on AWS EMR with YARN and get following 
error that kill my application:

AM Container for appattempt_1500320286695_0001_01 exited with exitCode: -104
For more detailed output, check application tracking 
page:http://ip-172-31-35-192.eu-central-1.compute.internal:8088/cluster/app/application_1500320286695_0001Then,
 click on links to logs of each attempt.
Diagnostics: Container 
[pid=9216,containerID=container_1500320286695_0001_01_01] is running beyond 
physical memory limits. Current usage: 1.4 GB of 1.4 GB physical memory used; 
3.3 GB of 6.9 GB virtual memory used. Killing container.


I already change spark.yarn.executor.memoryOverhead but the error still occurs. 
Does anybody have a hint for me which parameter or configuration I have to 
adapt.

Thank you very much.

Regards,

Pascal Stammer




Spark on yarn logging

2017-06-29 Thread John Vines
I followed the instructions for configuring a custom logger per
https://spark.apache.org/docs/2.0.2/running-on-yarn.html (because we have
long running spark jobs, sometimes occasionally get stuck and without a
rolling file appender will fill up disk). This seems to work well for us,
but it breaks the web-ui because it only has links for stderr/stdout.

I can take that url and manually change it, but I'm wondering if there's a
way to configure the spark-ui to look for files of a specific format so
that way no manual url manipulation is necessary to view the logs.

Thanks


spark on yarn cluster model can't use saveAsTable ?

2017-05-15 Thread lk_spark
hi,all:
I have a test under spark2.1.0 , which read txt files as DataFrame and 
save to hive . When I submit the app jar with yarn client model it works well , 
but If I submit with cluster model , it will not create table and write data , 
and I didn't find any error log ... can anybody give me some clue?

2017-05-15


lk_spark 

Re: notebook connecting Spark On Yarn

2017-02-15 Thread Jon Gregg
Could you just make Hadoop's resource manager (port 8088) available to your
users, and they can check available containers that way if they see the
launch is stalling?

Another option is to reduce the default # of executors and memory per
executor in the launch script to some small fraction of your cluster size,
and make it so users can manually ask for more if they need to.  It doesn't
take a whole lot of workers/memory to build most of your spark code off a
sample.

Jon

On Wed, Feb 15, 2017 at 6:41 AM, Sachin Aggarwal <different.sac...@gmail.com
> wrote:

> Hi,
>
> I am trying to create multiple notebooks connecting to spark on yarn.
> After starting few jobs my cluster went out of containers. All new notebook
> request are in busy state as Jupyter kernel gateway is not getting any
> containers for master to be started.
>
> Some job are not leaving the containers for approx 10-15 mins. so user is
> not able to figure out what is wrong, why his kernel is still in busy state
>
> Is there any property or hack by which I can return valid response to
> users that there are no containers left.
>
> can I label/mark few containers for master equal to max kernel execution I
> am allowing in my cluster. so that if new kernel starts he will at least
> one container for master. it can be dynamic on priority based. if there is
> no container left then yarn can preempt some containers and provide them to
> new requests.
>
>
> --
>
> Thanks & Regards
>
> Sachin Aggarwal
> 7760502772
>


notebook connecting Spark On Yarn

2017-02-15 Thread Sachin Aggarwal
Hi,

I am trying to create multiple notebooks connecting to spark on yarn. After
starting few jobs my cluster went out of containers. All new notebook
request are in busy state as Jupyter kernel gateway is not getting any
containers for master to be started.

Some job are not leaving the containers for approx 10-15 mins. so user is
not able to figure out what is wrong, why his kernel is still in busy state

Is there any property or hack by which I can return valid response to users
that there are no containers left.

can I label/mark few containers for master equal to max kernel execution I
am allowing in my cluster. so that if new kernel starts he will at least
one container for master. it can be dynamic on priority based. if there is
no container left then yarn can preempt some containers and provide them to
new requests.


-- 

Thanks & Regards

Sachin Aggarwal
7760502772


Re: spark on yarn can't load kafka dependency jar

2016-12-15 Thread Mich Talebzadeh
try this it should work and yes they are comma separated

spark-streaming-kafka_2.10-1.5.1.jar

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 15 December 2016 at 22:49, neil90 <neilp1...@icloud.com> wrote:

> Don't the jars need to be comma sperated when you pass?
>
> i.e. --jars "hdfs://zzz:8020/jars/kafka_2.10-0.8.2.2.jar",
> /opt/bigdevProject/sparkStreaming_jar4/sparkStreaming.jar
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/spark-on-yarn-can-t-load-kafka-
> dependency-jar-tp28216p28220.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: spark on yarn can't load kafka dependency jar

2016-12-15 Thread neil90
Don't the jars need to be comma sperated when you pass?

i.e. --jars "hdfs://zzz:8020/jars/kafka_2.10-0.8.2.2.jar",
/opt/bigdevProject/sparkStreaming_jar4/sparkStreaming.jar 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-on-yarn-can-t-load-kafka-dependency-jar-tp28216p28220.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Can i display message on console when use spark on yarn?

2016-10-20 Thread ayan guha
What do you exactly mean by Yarn Console? We use spark-submit and it
generates exactly same log as you mentioned on driver console,

On Thu, Oct 20, 2016 at 8:21 PM, Jone Zhang <joyoungzh...@gmail.com> wrote:

> I submit spark with "spark-submit --master yarn-cluster --deploy-mode
> cluster"
> How can i display message on yarn console.
> I expect it to be like this:
>
> .
> 16/10/20 17:12:53 main INFO org.apache.spark.deploy.yarn.Client>SPK>
> Application report for application_1453970859007_481440 (state: RUNNING)
> 16/10/20 17:12:58 main INFO org.apache.spark.deploy.yarn.Client>SPK>
> Application report for application_1453970859007_481440 (state: RUNNING)
> 16/10/20 17:13:03 main INFO org.apache.spark.deploy.yarn.Client>SPK>
> Application report for application_1453970859007_481440 (state: RUNNING)
> 16/10/20 17:13:08 main INFO org.apache.spark.deploy.yarn.Client>SPK>
> Application report for application_1453970859007_481440 (state: RUNNING)
> 16/10/20 17:13:13 main INFO org.apache.spark.deploy.yarn.Client>SPK>
> Application report for application_1453970859007_481440 (state: FINISHED)
> 16/10/20 17:13:13 main INFO org.apache.spark.deploy.yarn.Client>SPK>
>  client token: N/A
>  diagnostics: N/A
>  ApplicationMaster host: 10.51.215.100
>  ApplicationMaster RPC port: 0
>  queue: root.default
>  start time: 1476954698645
>  final status: SUCCEEDED
>  tracking URL: http://10.179.20.47:8080/proxy/application_
> 1453970859007_481440/history/application_1453970859007_481440/1
>  user: mqq
> ===Spark Task Result is ===
> ===some message want to display===
> 16/10/20 17:13:13 Thread-3 INFO org.apache.spark.util.ShutdownHookManager>SPK>
> Shutdown hook called
> 16/10/20 17:13:13 Thread-3 INFO org.apache.spark.util.ShutdownHookManager>SPK>
> Deleting directory /data/home/spark_tmp/spark-5b9f434b-5837-46e6-9625-
> c4656b86af9e
>
> Thanks.
>
>
>


-- 
Best Regards,
Ayan Guha


Can i display message on console when use spark on yarn?

2016-10-20 Thread Jone Zhang
I submit spark with "spark-submit --master yarn-cluster --deploy-mode
cluster"
How can i display message on yarn console.
I expect it to be like this:

.
16/10/20 17:12:53 main INFO org.apache.spark.deploy.yarn.Client>SPK>
Application report for application_1453970859007_481440 (state: RUNNING)
16/10/20 17:12:58 main INFO org.apache.spark.deploy.yarn.Client>SPK>
Application report for application_1453970859007_481440 (state: RUNNING)
16/10/20 17:13:03 main INFO org.apache.spark.deploy.yarn.Client>SPK>
Application report for application_1453970859007_481440 (state: RUNNING)
16/10/20 17:13:08 main INFO org.apache.spark.deploy.yarn.Client>SPK>
Application report for application_1453970859007_481440 (state: RUNNING)
16/10/20 17:13:13 main INFO org.apache.spark.deploy.yarn.Client>SPK>
Application report for application_1453970859007_481440 (state: FINISHED)
16/10/20 17:13:13 main INFO org.apache.spark.deploy.yarn.Client>SPK>
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: 10.51.215.100
 ApplicationMaster RPC port: 0
 queue: root.default
 start time: 1476954698645
 final status: SUCCEEDED
 tracking URL:
http://10.179.20.47:8080/proxy/application_1453970859007_481440/history/application_1453970859007_481440/1
 user: mqq
===Spark Task Result is ===
===some message want to display===
16/10/20 17:13:13 Thread-3 INFO
org.apache.spark.util.ShutdownHookManager>SPK> Shutdown hook called
16/10/20 17:13:13 Thread-3 INFO
org.apache.spark.util.ShutdownHookManager>SPK> Deleting directory
/data/home/spark_tmp/spark-5b9f434b-5837-46e6-9625-c4656b86af9e

Thanks.


DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

2016-10-12 Thread shankinson
Hi,

We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We
had written some new code using the Spark DataFrame/DataSet APIs but are
noticing incorrect results on a join after writing and then reading data to
Windows Azure Storage Blobs (The default HDFS location). I've been able to
duplicate the issue with the following snippet of code running on the
cluster.

case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score:
Double)

val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1,
1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS

dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show
outputs

+-+-+-+ 
| user|dimension|score|
+-+-+-+
|12345|0|  1.0|
+-+-+-+

+-+---+-+
|dimension|cluster|score|
+-+---+-+
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|
+-+---+-+

+-+-+-+-+---+-+
| user|dimension|score|dimension|cluster|score|
+-+-+-+-+---+-+
|12345|0|  1.0|0|  1|  1.0|
+-+-+-+-+---+-+
which is correct. However after writing and reading the data, we see this

dims.write.mode("overwrite").save("/tmp/dims2.parquet")
cent.write.mode("overwrite").save("/tmp/cent2.parquet")

val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions]
val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore]

dims2.show
cent2.show

dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show
outputs

+-+-+-+ 
| user|dimension|score|
+-+-+-+
|12345|0|  1.0|
+-+-+-+

+-+---+-+
|dimension|cluster|score|
+-+---+-+
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|
+-+---+-+

+-+-+-+-+---+-+
| user|dimension|score|dimension|cluster|score|
+-+-+-+-+---+-+
|12345|0|  1.0| null|   null| null|
+-+-+-+-+---+-+
However, using the RDD API produces the correct result

dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row =>
(row.dimension, row) ) ).take(5)

res5: Array[(Long, (UserDimensions, CentroidClusterScore))] =
Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0
We've tried changing the output format to ORC instead of parquet, but we see
the same results. Running Spark 2.0 locally, not on a cluster, does not have
this issue. Also running spark in local mode on the master node of the
Hadoop cluster also works. Only when running on top of YARN do we see this
issue.

This also seems very similar to this issue:
https://issues.apache.org/jira/browse/SPARK-10896

Thoughts?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-Dataset-join-not-producing-correct-results-in-Spark-2-0-Yarn-tp27888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

2016-10-12 Thread Stephen Hankinson
Hi,

We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We
had written some new code using the Spark DataFrame/DataSet APIs but are
noticing incorrect results on a join after writing and then reading data to
Windows Azure Storage Blobs (The default HDFS location). I've been able to
duplicate the issue with the following snippet of code running on the
cluster.

case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double)

val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1,
1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2,
1.0))).toDS

dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show

outputs

+-+-+-+
| user|dimension|score|
+-+-+-+
|12345|0|  1.0|
+-+-+-+

+-+---+-+
|dimension|cluster|score|
+-+---+-+
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|
+-+---+-+

+-+-+-+-+---+-+
| user|dimension|score|dimension|cluster|score|
+-+-+-+-+---+-+
|12345|0|  1.0|0|  1|  1.0|
+-+-+-+-+---+-+

which is correct. However after writing and reading the data, we see this

dims.write.mode("overwrite").save("/tmp/dims2.parquet")
cent.write.mode("overwrite").save("/tmp/cent2.parquet")

val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions]
val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore]

dims2.show
cent2.show

dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show

outputs

+-+-+-+
| user|dimension|score|
+-+-+-+
|12345|0|  1.0|
+-+-+-+

+-+---+-+
|dimension|cluster|score|
+-+---+-+
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|
+-+---+-+

+-+-+-+-+---+-+
| user|dimension|score|dimension|cluster|score|
+-+-+-+-+---+-+
|12345|0|  1.0| null|   null| null|
+-+-+-+-+---+-+

However, using the RDD API produces the correct result

dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row
=> (row.dimension, row) ) ).take(5)

res5: Array[(Long, (UserDimensions, CentroidClusterScore))] =
Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0

We've tried changing the output format to ORC instead of parquet, but we
see the same results. Running Spark 2.0 locally, not on a cluster, does not
have this issue. Also running spark in local mode on the master node of the
Hadoop cluster also works. Only when running on top of YARN do we see this
issue.

This also seems very similar to this issue: https://issues.apache.
org/jira/browse/SPARK-10896
Thoughts?


*Stephen Hankinson*


Re: Spark on yarn enviroment var

2016-10-01 Thread Vadim Semenov
The question should be addressed to the oozie community.

As far as I remember, a spark action doesn't have support of env variables.

On Fri, Sep 30, 2016 at 8:11 PM, Saurabh Malviya (samalviy) <
samal...@cisco.com> wrote:

> Hi,
>
>
>
> I am running spark on yarn using oozie.
>
>
>
> When submit through command line using spark-submit spark is able to read
> env variable.  But while submit through oozie its not able toget env
> variable and don’t see driver log.
>
>
>
> Is there any way we specify env variable in oozie spark action.
>
>
>
> Saurabh
>


Spark on yarn enviroment var

2016-09-30 Thread Saurabh Malviya (samalviy)
Hi,

I am running spark on yarn using oozie.

When submit through command line using spark-submit spark is able to read env 
variable.  But while submit through oozie its not able toget env variable and 
don't see driver log.

Is there any way we specify env variable in oozie spark action.

Saurabh


Does Spark on YARN inherit or replace the Hadoop/YARN configs?

2016-08-30 Thread Everett Anderson
Hi,

I've had a bit of trouble getting Spark on YARN to work. When executing in
this mode and submitting from outside the cluster, one must set
HADOOP_CONF_DIR or YARN_CONF_DIR
<https://spark.apache.org/docs/latest/running-on-yarn.html>, from which
spark-submit can find the params it needs to locate and talk to the YARN
application manager.

However, Spark also packages up all the Hadoop+YARN config files, ships
them to the cluster, and then uses them there.

Does it only override settings on the cluster using those shipped files? Or
does it use those entirely instead of the config the cluster already has?

My impression is that it currently replaces rather than overrides, which
means you can't construct a minimal client-side Hadoop/YARN config with
only the properties necessary to find the cluster. Is that right?


Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Mungeol Heo
Try to turn yarn.scheduler.capacity.resource-calculator on, then check again.

On Wed, Aug 3, 2016 at 4:53 PM, Saisai Shao <sai.sai.s...@gmail.com> wrote:
> Use dominant resource calculator instead of default resource calculator will
> get the expected vcores as you wanted. Basically by default yarn does not
> honor cpu cores as resource, so you will always see vcore is 1 no matter
> what number of cores you set in spark.
>
> On Wed, Aug 3, 2016 at 12:11 PM, satyajit vegesna
> <satyajit.apas...@gmail.com> wrote:
>>
>> Hi All,
>>
>> I am trying to run a spark job using yarn, and i specify --executor-cores
>> value as 20.
>> But when i go check the "nodes of the cluster" page in
>> http://hostname:8088/cluster/nodes then i see 4 containers getting created
>> on each of the node in cluster.
>>
>> But can only see 1 vcore getting assigned for each containier, even when i
>> specify --executor-cores 20 while submitting job using spark-submit.
>>
>> yarn-site.xml
>> 
>> yarn.scheduler.maximum-allocation-mb
>> 6
>> 
>> 
>> yarn.scheduler.minimum-allocation-vcores
>> 1
>> 
>> 
>> yarn.scheduler.maximum-allocation-vcores
>> 40
>> 
>> 
>> yarn.nodemanager.resource.memory-mb
>> 7
>> 
>> 
>> yarn.nodemanager.resource.cpu-vcores
>> 20
>> 
>>
>>
>> Did anyone face the same issue??
>>
>> Regards,
>> Satyajit.
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Mungeol Heo
Try to turn "yarn.scheduler.capacity.resource-calculator" on

On Wed, Aug 3, 2016 at 4:53 PM, Saisai Shao <sai.sai.s...@gmail.com> wrote:
> Use dominant resource calculator instead of default resource calculator will
> get the expected vcores as you wanted. Basically by default yarn does not
> honor cpu cores as resource, so you will always see vcore is 1 no matter
> what number of cores you set in spark.
>
> On Wed, Aug 3, 2016 at 12:11 PM, satyajit vegesna
> <satyajit.apas...@gmail.com> wrote:
>>
>> Hi All,
>>
>> I am trying to run a spark job using yarn, and i specify --executor-cores
>> value as 20.
>> But when i go check the "nodes of the cluster" page in
>> http://hostname:8088/cluster/nodes then i see 4 containers getting created
>> on each of the node in cluster.
>>
>> But can only see 1 vcore getting assigned for each containier, even when i
>> specify --executor-cores 20 while submitting job using spark-submit.
>>
>> yarn-site.xml
>> 
>> yarn.scheduler.maximum-allocation-mb
>> 6
>> 
>> 
>> yarn.scheduler.minimum-allocation-vcores
>> 1
>> 
>> 
>> yarn.scheduler.maximum-allocation-vcores
>> 40
>> 
>> 
>> yarn.nodemanager.resource.memory-mb
>> 7
>> 
>> 
>> yarn.nodemanager.resource.cpu-vcores
>> 20
>> 
>>
>>
>> Did anyone face the same issue??
>>
>> Regards,
>> Satyajit.
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



  1   2   3   4   5   6   >