FYI: Prof John Canny is giving a talk on Machine Learning at the limit in SF Big Analytics Meetup
Just in case you are in San Francisco, we are having a meetup by Prof John Canny http://www.meetup.com/SF-Big-Analytics/events/220427049/ Chester
Re: Spark impersonation
Sorry for the many typos as I was typing from my cell phone. Hope you still can get the idea. On Sat, Feb 7, 2015 at 1:55 PM, Chester @work ches...@alpinenow.com wrote: I just implemented this in our application. The impersonation is done before the job is submitted. In spark yarn (we are using yarn cluster mode) , it just takes the current User from UserGroupInfoemation and summitted to yarn resource manager. If one use Kinit from command line, the who Jvm needs to has the same principal and you have to handle ticket expiration with cron job. If this is individual cli at hoc job, this might be ok. But if you intended to use an application to run spark job and end user interact with spark, then you need set up a service super user use that user to login to Kerbros KDC (Kinit equivalent) programmally, then create proxy user to impersonate end user. You can handle ticket expiration in code as well. So there is no need of cron job Certainly one can move all these logic to spark, one need to create spark service user principal and keytab. As part of the spark job submit , one can pass the principal and keytab location to the spark and spark can create a proxy user if the authentication is Kerberos, as well as add job delegation tokens I will love to contribute this if we need this in spark , as I just completed the Hadoop Kerberos authentication feature, It covers pig, map reduce , spark, sqoops as well as standard HDFS access. I will take a look at sandy's jira Chester On Feb 2, 2015, at 2:37 PM, Jim Green openkbi...@gmail.com wrote: Hi Team, Does spark support impersonation? For example, when spark on yarn/hive/hbase/etc..., which user is used by default? The user which starts the spark job? Any suggestions related to impersonation? -- Thanks, www.openkb.info (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)
Re: Possible bug in ClientBase.scala?
Ron, Which distribution and Version of Hadoop are you using ? I just looked at CDH5 ( hadoop-mapreduce-client-core- 2.3.0-cdh5.0.0), MRJobConfig does have the field : java.lang.String DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH; Chester On Sun, Jul 13, 2014 at 6:49 PM, Ron Gonzalez zlgonza...@yahoo.com wrote: Hi, I was doing programmatic submission of Spark yarn jobs and I saw code in ClientBase.getDefaultYarnApplicationClasspath(): val field = classOf[MRJobConfig].getField(DEFAULT_YARN_APPLICATION_CLASSPATH) MRJobConfig doesn't have this field so the created launch env is incomplete. Workaround is to set yarn.application.classpath with the value from YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH. This results in having the spark job hang if the submission config is different from the default config. For example, if my resource manager port is 8050 instead of 8030, then the spark app is not able to register itself and stays in ACCEPTED state. I can easily fix this by changing this to YarnConfiguration instead of MRJobConfig but was wondering what the steps are for submitting a fix. Thanks, Ron Sent from my iPhone
Re: spark-assembly libraries conflict with needed libraries
I don't have experience deploying to EC2. can you use add.jar conf to add the missing jar at runtime ? I haven't tried this myself. Just a guess. On Mon, Jul 7, 2014 at 12:16 PM, Chester Chen ches...@alpinenow.com wrote: with provided scope, you need to provide the provided jars at the runtime yourself. I guess in this case Hadoop jar files. On Mon, Jul 7, 2014 at 12:13 PM, Robert James srobertja...@gmail.com wrote: Thanks - that did solve my error, but instead got a different one: java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/lib/input/FileInputFormat It seems like with that setting, spark can't find Hadoop. On 7/7/14, Koert Kuipers ko...@tresata.com wrote: spark has a setting to put user jars in front of classpath, which should do the trick. however i had no luck with this. see here: https://issues.apache.org/jira/browse/SPARK-1863 On Mon, Jul 7, 2014 at 1:31 PM, Robert James srobertja...@gmail.com wrote: spark-submit includes a spark-assembly uber jar, which has older versions of many common libraries. These conflict with some of the dependencies we need. I have been racking my brain trying to find a solution (including experimenting with ProGuard), but haven't been able to: when we use spark-submit, we get NoMethodErrors, even though the code compiles fine, because the runtime classes are different than the compile time classes! Can someone recommend a solution? We are using scala, sbt, and sbt-assembly, but are happy using another tool (please provide instructions how to).
Re: Issues in opening UI when running Spark Streaming in YARN
@Andrew Yes, the link point to the same redirected http://localhost/proxy/application_1404443455764_0010/ I suspect something todo with the cluster setup. I will let you know once I found something. Chester On Mon, Jul 7, 2014 at 1:07 PM, Andrew Or and...@databricks.com wrote: @Yan, the UI should still work. As long as you look into the container that launches the driver, you will find the SparkUI address and port. Note that in yarn-cluster mode the Spark driver doesn't actually run in the Application Manager; just like the executors, it runs in a container that is launched by the Resource Manager after the Application Master requests the container resources. In contrast, in yarn-client mode, your driver is not launched in a container, but in the client process that launched your application (i.e. spark-submit), so the stdout of this program directly contains the SparkUI messages. @Chester, I'm not sure what has gone wrong as there are many factors at play here. When you go the Resource Manager UI, does the application URL link point you to the same SparkUI address as indicated in the logs? If so, this is the correct behavior. However, I believe the redirect error has little to do with Spark itself, but more to do with how you set up the cluster. I have actually run into this myself, but I haven't found a workaround. Let me know if you find anything. 2014-07-07 12:07 GMT-07:00 Chester Chen ches...@alpinenow.com: As Andrew explained, the port is random rather than 4040, as the the spark driver is started in Application Master and the port is random selected. But I have the similar UI issue. I am running Yarn Cluster mode against my local CDH5 cluster. The log states 14/07/07 11:59:29 INFO ui.SparkUI: Started SparkUI at http://10.0.0.63:58750 but when you client the spark UI link (ApplicationMaster or http://10.0.0.63:58750), I will got a 404 with the redirect URI http://localhost/proxy/application_1404443455764_0010/ Looking at the Spark code, notice that the proxy is reallya variable to get the proxy at the yarn-site.xml http address. But when I specified the value at yarn-site.xml, it still doesn't work for me. Oddly enough, it works for my co-worker on Pivotal HD cluster, therefore I am still looking what's the difference in terms of cluster setup or something else. Chester On Mon, Jul 7, 2014 at 11:42 AM, Andrew Or and...@databricks.com wrote: I will assume that you are running in yarn-cluster mode. Because the driver is launched in one of the containers, it doesn't make sense to expose port 4040 for the node that contains the container. (Imagine if multiple driver containers are launched on the same node. This will cause a port collision). If you're launching Spark from a gateway node that is physically near your worker nodes, then you can just launch your application in yarn-client mode, in which case the SparkUI will always be started on port 4040 on the node that you ran spark-submit on. The reason why sometimes you see the red text is because it appears only on the driver containers, not the executor containers. This is because SparkUI belongs to the SparkContext, which only exists on the driver. Andrew 2014-07-07 11:20 GMT-07:00 Yan Fang yanfang...@gmail.com: Hi guys, Not sure if you have similar issues. Did not find relevant tickets in JIRA. When I deploy the Spark Streaming to YARN, I have following two issues: 1. The UI port is random. It is not default 4040. I have to look at the container's log to check the UI port. Is this suppose to be this way? 2. Most of the time, the UI does not work. The difference between logs are (I ran the same program): *14/07/03 11:38:50 INFO spark.HttpServer: Starting HTTP Server14/07/03 11:38:50 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/03 11:38:50 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:12026 http://SocketConnector@0.0.0.0:1202614/07/03 11:38:51 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 0 14/07/03 11:38:51 INFO executor.Executor: Running task ID 0...* 14/07/02 16:55:32 INFO spark.HttpServer: Starting HTTP Server 14/07/02 16:55:32 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/02 16:55:32 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:14211 *14/07/02 16:55:32 INFO ui.JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter14/07/02 16:55:32 INFO server.Server: jetty-8.y.z-SNAPSHOT14/07/02 16:55:32 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:21867 http://SelectChannelConnector@0.0.0.0:21867 14/07/02 16:55:32 INFO ui.SparkUI: Started SparkUI at http://myNodeName:21867 http://myNodeName:2186714/07/02 16:55:32 INFO cluster.YarnClusterScheduler: Created YarnClusterScheduler* When the red part comes, the UI works sometime. Any ideas? Thank you. Best, Fang, Yan yanfang...@gmail.com +1 (206) 849
Re: Shark Vs Spark SQL
Yes, they have announced that Shark is no longer under development and be replaced with Spark SQL in Spark Summit 2014. Chester On Wed, Jul 2, 2014 at 3:53 PM, Subacini B subac...@gmail.com wrote: Hi, http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3cb75376b8-7a57-4161-b604-f919886cf...@gmail.com%3E This talks about Shark backend will be replaced with Spark SQL engine in future. Does that mean Spark will continue to support Shark + Spark SQL for long term? OR After some period, Shark will be decommissioned ?? Thanks Subacini
Re: Is spark context in local mode thread-safe?
Matei, If we use different Akka actors to process different user's requests, (not different threads), is the SparkContext still safe to use for different users ? Yes, it would be nice to disable UI via configuration,especially when we develop locally. We use sbt-web plugin to debug tomcat code. If we can disable the UI http Server; it would be much simpler to handle than having two http containers to deal with. Chester On Monday, June 9, 2014 4:35 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You currently can’t have multiple SparkContext objects in the same JVM, but within a SparkContext, all of the APIs are thread-safe so you can share that context between multiple threads. The other issue you’ll run into is that in each thread where you want to use Spark, you need to use SparkEnv.set(env) where “env” was obtained by SparkEnv.get in the thread that created the context. This requirement will hopefully go away soon. Unfortunately there’s no way yet to disable the UI — feel free to open a JIRA for it, it shouldn’t be hard to do. Matei On Jun 9, 2014, at 3:50 PM, DB Tsai dbt...@stanford.edu wrote: Hi guys, We would like to use spark hadoop api to get the first couple hundred lines in design time to quickly show users the file-structure/meta data, and the values in those lines without launching the full spark job in cluster. Since we're web-based application, there will be multiple users using the spark hadoop api, for exmaple, sc.textFile(filePath). I wonder if those APIs are thread-safe in local mode (each user will have its own SparkContext object). Secondly, it seems that even in local mode, the jetty UI tracker will be lunched. For this kind of cheap operation, having jetty UI tracker for each operation will be very expensive. Is there a way to disable this behavior? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai
Re: Is spark context in local mode thread-safe?
Matei, Thanks for the insight, we have to carefully consider our design. We are in the processing moving our system to Akka, it would be nice to use Akka all the way. But I understand the limitations. Thanks Chester On Monday, June 9, 2014 5:06 PM, Matei Zaharia matei.zaha...@gmail.com wrote: In general you probably shouldn’t use actors for processing requests because Spark operations are blocking, and Akka only has a limited thread pool for each ActorSystem. You risk blocking all the threads with ongoing requests and not being able to service new ones. That said though, you can configure Akka to spawn more threads and in that case it would probably be okay. See http://doc.akka.io/docs/akka/snapshot/java/dispatchers.html for some details on Akka thread usage and how to configure it. Matei On Jun 9, 2014, at 4:54 PM, Chester Chen chesterxgc...@yahoo.com wrote: Matei, If we use different Akka actors to process different user's requests, (not different threads), is the SparkContext still safe to use for different users ? Yes, it would be nice to disable UI via configuration,especially when we develop locally. We use sbt-web plugin to debug tomcat code. If we can disable the UI http Server; it would be much simpler to handle than having two http containers to deal with. Chester On Monday, June 9, 2014 4:35 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You currently can’t have multiple SparkContext objects in the same JVM, but within a SparkContext, all of the APIs are thread-safe so you can share that context between multiple threads. The other issue you’ll run into is that in each thread where you want to use Spark, you need to use SparkEnv.set(env) where “env” was obtained by SparkEnv.get in the thread that created the context. This requirement will hopefully go away soon. Unfortunately there’s no way yet to disable the UI — feel free to open a JIRA for it, it shouldn’t be hard to do. Matei On Jun 9, 2014, at 3:50 PM, DB Tsai dbt...@stanford.edu wrote: Hi guys, We would like to use spark hadoop api to get the first couple hundred lines in design time to quickly show users the file-structure/meta data, and the values in those lines without launching the full spark job in cluster. Since we're web-based application, there will be multiple users using the spark hadoop api, for exmaple, sc.textFile(filePath). I wonder if those APIs are thread-safe in local mode (each user will have its own SparkContext object). Secondly, it seems that even in local mode, the jetty UI tracker will be lunched. For this kind of cheap operation, having jetty UI tracker for each operation will be very expensive. Is there a way to disable this behavior? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai
[ANN]: Scala By The Bay Conference ( aka Silicon Valley Scala Symposium)
Hi, This is not related to Spark. But I thought you might be interested in the second SF Scala conference is coming this August. The SF Scala conference was called Sillicon Valley Scala Symposium last year. From now on, it will be known as Scala By The Bay. http://www.scalabythebay.org -- watch that space for announcements and the CFP! Chester
Is Branch 1.0 build broken ?
I just updated and got the following: [error] (external-mqtt/*:update) sbt.ResolveException: unresolved dependency: org.eclipse.paho#mqtt-client;0.4.0: not found [error] Total time: 7 s, completed Apr 10, 2014 4:27:09 PM Chesters-MacBook-Pro:spark chester$ git branch * branch-1.0 master Looks like certain dependency mqtt-client resolver is not specified. Chester