Connect to two different HDFS servers with different usernames
Is there any way to get data from HDFS (e.g. with sc.textFile) with two separate usernames in the same Spark job? For instance, if I have a file on hdfs-server-1.com and the alice user has permission to view it, and I have a file on hdfs-server-2.com and the bob user has permission to view it, I'd like to be able to do something like: Is there any way to do something like this? Or can Spark only connect to HDFS with the same username that it's running as? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Connect-to-two-different-HDFS-servers-with-different-usernames-tp26143.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: newAPIHadoopFile uses AWS credentials from other threads
Hmmm, I seem to be able to get around this by setting hadoopConf1.setBoolean("fs.s3n.impl.disable.cache", true) in my code. Is there anybody more familiar with Hadoop who can confirm that the filesystem cache would cause this issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/newAPIHadoopFile-uses-AWS-credentials-from-other-threads-tp26081p26082.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Exceptions in threads in executor code don't get caught properly
Sorry, I guess my code and the exception didn't make it to the mailing list. Here's my code: def main(args: Array[String]) { val conf = new SparkConf().setAppName("Test app") val sc = new SparkContext(conf) val rdd = sc.parallelize(Array(1, 2, 3)) val rdd1 = rdd.map({x => throw new Exception("Test exception 12345") x }) rdd1.foreachPartition(part => { val t = new Thread(new Runnable { override def run(): Unit = { for (row <- part) { Console.println(s"Got $row") } } }) t.start() t.join() }) } And here's the exception: 15/08/31 19:42:20 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Thread-7,5,main] java.lang.Exception: Test exception 12345 at TestApp$anonfun$1.apply$mcII$sp(TestApp.scala:15) at TestApp$anonfun$1.apply(TestApp.scala:14) at TestApp$anonfun$1.apply(TestApp.scala:14) at scala.collection.Iterator$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at TestApp$anonfun$main$1$anon$1.run(TestApp.scala:21) at java.lang.Thread.run(Thread.java:745) Anyways, I realized that this is occurs because of the SparkUncaughtExceptionHandler, which executors set as their threads' default uncaught exception handler. Any exceptions that happen in the main thread get caught here: https://github.com/apache/spark/blob/69c9c177160e32a2fbc9b36ecc52156077fca6fc/core/src/main/scala/org/apache/spark/executor/Executor.scala#L294 And thus get logged to task metrics. The SparkUncaughtExceptionHandler, however, catches stuff in subthreads, and therefore those exceptions don't get logged in task metrics (and thus not in the web UI). My fix was to catch exceptions within my threads, like this: rdd1.foreachPartition(part => { var exception = null val t = new Thread(new Runnable { override def run(): Unit = { try { for (row <- part) { Console.println(s"Got $row") } } catch { case e: Exception => exception = e } } }) t.start() t.join() if (exception != null) { throw e } }) It'd be nice if the uncaught exception handler in executors logged exceptions to task metrics, but I'm not sure how feasible that'd be. On Thu, Sep 3, 2015 at 12:26 AM, Akhil Das wrote: > [image: Inline image 1] > > I'm not able to find the piece of code that you wrote, but you can use a > try...catch to catch your user specific exceptions and log it in the logs. > > Something like this: > > myRdd.map(x => try{ //something }catch{ case e:Exception => > log.error("Whoops!! :" + e) }) > > > > > Thanks > Best Regards > > On Tue, Sep 1, 2015 at 1:22 AM, Wayne Song wrote: > >> We've been running into a situation where exceptions in rdd.map() calls >> will >> not get recorded and shown on the web UI properly. We've discovered that >> this seems to occur because we're creating our own threads in >> foreachPartition() calls. If I have code like this: >> >> >> >> The tasks on the executors will fail because rdd1 will raise an exception >> for each record as we iterate across the "part" iterator inside the thread >> in the foreachPartition call. Usually, exceptions in Spark apps show up >> in >> the web UI on the application detail page, making problems easy to debug. >> However, if any exceptions get raised inside of these user threads, they >> don't show up in the web UI (instead, it just says that the executor was >> lost), and in the executor logs, we see errors like: >> >> >> >> What's going on here? Why are these exceptions not caught? And is there >> a >> way to have user threads register their exceptions properly? >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Exceptions-in-threads-in-executor-code-don-t-get-caught-properly-tp24525.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Exceptions in threads in executor code don't get caught properly
We've been running into a situation where exceptions in rdd.map() calls will not get recorded and shown on the web UI properly. We've discovered that this seems to occur because we're creating our own threads in foreachPartition() calls. If I have code like this: The tasks on the executors will fail because rdd1 will raise an exception for each record as we iterate across the "part" iterator inside the thread in the foreachPartition call. Usually, exceptions in Spark apps show up in the web UI on the application detail page, making problems easy to debug. However, if any exceptions get raised inside of these user threads, they don't show up in the web UI (instead, it just says that the executor was lost), and in the executor logs, we see errors like: What's going on here? Why are these exceptions not caught? And is there a way to have user threads register their exceptions properly? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Exceptions-in-threads-in-executor-code-don-t-get-caught-properly-tp24525.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Getting java.net.BindException when attempting to start Spark master on EC2 node with public IP
I made this message with the Nabble web interface; I included the stack trace there, but I guess it didn't show up in the emails. Anyways, here's the stack trace: 15/07/27 17:04:09 ERROR NettyTransport: failed to bind to /54.xx.xx.xx:7093, shutting down Netty transport Exception in thread "main" java.net.BindException: Failed to bind to: /54.xx.xx.xx:7093: Service 'sparkMaster' failed after 16 retries! at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) at akka.remote.transport.netty.NettyTransport$anonfun$listen$1.apply(NettyTransport.scala:393) at akka.remote.transport.netty.NettyTransport$anonfun$listen$1.apply(NettyTransport.scala:389) at scala.util.Success$anonfun$map$1.apply(Try.scala:206) at scala.util.Try$.apply(Try.scala:161) at scala.util.Success.map(Try.scala:206) at scala.concurrent.Future$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.Future$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$Batch$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) at akka.dispatch.BatchingExecutor$Batch$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) at akka.dispatch.BatchingExecutor$Batch$anonfun$run$1.apply(BatchingExecutor.scala:59) at akka.dispatch.BatchingExecutor$Batch$anonfun$run$1.apply(BatchingExecutor.scala:59) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) I'm using Spark 1.4.0. Binding to 0.0.0.0 works, but then workers can't connect to the Spark master, because when you start a worker, you have to give it the Spark master URL in the form spark://:7077. My understanding is that because of Akka, you have to bind to the exact hostname that you used when you started the Spark master; thus, you can't bind to 0.0.0.0 on the Spark master machine and then connect to spark://54.xx.xx.xx:7077 or whatever. On Tue, Jul 28, 2015 at 6:15 AM, Ted Yu wrote: > Can you show the full stack trace ? > > Which Spark release are you using ? > > Thanks > > > > > On Jul 27, 2015, at 10:07 AM, Wayne Song wrote: > > > > Hello, > > > > I am trying to start a Spark master for a standalone cluster on an EC2 > node. > > The CLI command I'm using looks like this: > > > > > > > > Note that I'm specifying the --host argument; I want my Spark master to > be > > listening on a specific IP address. The host that I'm specifying (i.e. > > 54.xx.xx.xx) is the public IP for my EC2 node; I've confirmed that > nothing > > else is listening on port 7077 and that my EC2 security group has all > ports > > open. I've also double-checked that the public IP is correct. > > > > When I use --host 54.xx.xx.xx, I get the following error message: > > > > > > > > This does not occur if I leave out the --host argument and it doesn't > occur > > if I use --host 10.0.xx.xx, where 10.0.xx.xx is my private EC2 IP > address. > > > > Why would Spark fail to bind to a public EC2 address? > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Getting-java-net-BindException-when-attempting-to-start-Spark-master-on-EC2-node-with-public-IP-tp24011.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > >
Getting java.net.BindException when attempting to start Spark master on EC2 node with public IP
Hello, I am trying to start a Spark master for a standalone cluster on an EC2 node. The CLI command I'm using looks like this: Note that I'm specifying the --host argument; I want my Spark master to be listening on a specific IP address. The host that I'm specifying (i.e. 54.xx.xx.xx) is the public IP for my EC2 node; I've confirmed that nothing else is listening on port 7077 and that my EC2 security group has all ports open. I've also double-checked that the public IP is correct. When I use --host 54.xx.xx.xx, I get the following error message: This does not occur if I leave out the --host argument and it doesn't occur if I use --host 10.0.xx.xx, where 10.0.xx.xx is my private EC2 IP address. Why would Spark fail to bind to a public EC2 address? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Getting-java-net-BindException-when-attempting-to-start-Spark-master-on-EC2-node-with-public-IP-tp24011.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org