Unsubscribe
Unsubscribe
Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?
Not as far as I recall... From: Serega Sheypak Sent: Friday, January 18, 2019 3:21 PM To: user Subject: Spark on Yarn, is it possible to manually blacklist nodes before running spark job? Hi, is there any possibility to tell Scheduler to blacklist specific nodes in advance?
Spark on Yarn, is it possible to manually blacklist nodes before running spark job?
Hi, is there any possibility to tell Scheduler to blacklist specific nodes in advance?
Rdd pipe Subprocess exit code
When using rdd pipe(script), i get the following error : "java.lang.IllegalStateException: Subprocess exited with status 132. Command ran: "./script -h" I'm getting this while trying to run my external script with a simple "-h" argument to test that its running smoothly through my Spark code. When i run it as i ultimately intend to, which means with many more flags, i get same error but with exit status 1 instead of 132. In the stacktrace, there is no mention of the error that actually happened in the command. Checking the executor logs (by yarn logs -applicationId ), i only see what's already stated on the stacktrace. Also note that the app is running correctly on standalone. Anyone have any suggested course i should follow in solving this?> -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Question about RDD pipe
Thanks a lot for the answer! It solved my problem. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: dataset best practice question
Thanks! I wanted to avoid repeating f1, f2, f3 in class B. I wonder whether the encoders/decoders work if I use mixins On Tue, Jan 15, 2019 at 7:57 PM wrote: > Hi Mohit, > > > > I’m not sure that there is a “correct” answer here, but I tend to use > classes whenever the input or output data represents something meaningful > (such as a domain model object). I would recommend against creating many > temporary classes for each and every transformation step as that may be > difficult to maintain over time. > > > > Using *withColumn* statements will continue to work, and you don’t need > to cast to your output class until you’ve setup all tranformations. > Therefore, you can do things like: > > > > case class A (f1, f2, f3) > > case class B (f1, f2, f3, f4, f5, f6) > > > > ds_a = spark.read.csv(“path”).as[A] > > ds_b = ds_a > > .withColumn(“f4”, someUdf) > > .withColumn(“f5”, someUdf) > > .withColumn(“f6”, someUdf) > > .as[B] > > > > Kevin > > > > *From:* Mohit Jaggi > *Sent:* Tuesday, January 15, 2019 1:31 PM > *To:* user > *Subject:* dataset best practice question > > > > Fellow Spark Coders, > > I am trying to move from using Dataframes to Datasets for a reasonably > large code base. Today the code looks like this: > > > > df_a= read_csv > > df_b = df.withColumn ( some_transform_that_adds_more_columns ) > > //repeat the above several times > > > > With datasets, this will require defining > > > > case class A { f1, f2, f3 } //fields from csv file > > case class B { f1, f2, f3, f4 } //union of A and new field added by > some_transform_that_adds_more_columns > > //repeat this 10 times > > > > Is there a better way? > > > > Mohit. >
[SPARK ON K8]: How do you configure executors to use the keytab inside their image on Kubernetes?
Hi, I’m attempting to use Spark on Kubernetes to connect to a Kerberized Hadoop cluster. While I’m able to successfully connect to the company’s Hive tables and run queries on them, I’ve only managed to do this on a single driver pod (with no executors). If I use any executor pods, the process fails because the executors are not authenticating themselves with the keytab, returning a SIMPLE authentication error instead. This is surprising because the executors are using the same image as the driver and should, therefore, have the keytab and XML config files inside them. The driver is able to do authenticate itself with the keytab because it’s running the target JAR, which instructs it to do so. I can see that the executors are not running processes from the JAR, but are instead running tasks have been delegated by the driver. Please have a look at my stack overflow question which contains all the details: https://stackoverflow.com/questions/54181560/when-running-spark-on-kubernetes-to-access-kerberized-hadoop-cluster-how-do-you My main references while trying to implement this architecture have been the following: - https://github.com/apache/spark/blob/master/docs/security.md - https://www.slideshare.net/Hadoop_Summit/running-secured-spark-job-in-kubernetes-compute-cluster-and-integrating-with-kerberized-hdfs - https://www.iteblog.com/sparksummit2018/apache-spark-on-k8s-and-hdfs-security-with-ilan-flonenko-iteblog.pdf Initially I attempted option 2 in the first link, but it just failed with the error. I’ve also tried following the second and third link: I attempted to pass the keytab as a secret in one of the config parameter in the spark-submit job (as described here: https://spark.apache.org/docs/latest/running-on-kubernetes.html), but unfortunately this also returns the same error. I would be grateful for any advice you can offer. Thank you, Karan