FYI: Prof John Canny is giving a talk on Machine Learning at the limit in SF Big Analytics Meetup

2015-02-10 Thread Chester Chen
Just in case you are in San Francisco, we are having a meetup by Prof John
Canny

http://www.meetup.com/SF-Big-Analytics/events/220427049/


Chester


Re: Spark impersonation

2015-02-07 Thread Chester Chen
Sorry for the many typos as I was typing from my cell phone. Hope you still
can get the idea.

On Sat, Feb 7, 2015 at 1:55 PM, Chester @work ches...@alpinenow.com wrote:


  I just implemented this in our application. The impersonation is done
 before the job is submitted. In spark yarn (we are using yarn cluster mode)
 , it just takes the current User from UserGroupInfoemation and summitted to
 yarn resource manager.

 If one use Kinit from command line, the who Jvm needs to has the same
 principal and you have to handle ticket expiration with cron job.

 If this is individual cli at hoc job, this might be ok. But if you
 intended to use an application to run spark job and end user interact with
 spark, then you need set up a service super user use that user to login to
 Kerbros KDC (Kinit equivalent) programmally, then create proxy user to
 impersonate end user. You can handle ticket expiration in code as well. So
 there is no need of cron job

 Certainly one can move all these logic to spark, one need to create spark
 service user principal and keytab. As part of the spark job submit , one
 can pass the principal and keytab location to the spark and spark can
 create a proxy user if the authentication is Kerberos, as well as add job
 delegation tokens

 I will love to contribute this if we need this in spark , as I just
 completed the Hadoop Kerberos authentication feature, It covers pig, map
 reduce , spark, sqoops as well as standard HDFS access.

 I will take a look at sandy's jira

 Chester

 On Feb 2, 2015, at 2:37 PM, Jim Green openkbi...@gmail.com wrote:

 Hi Team,

 Does spark support impersonation?
 For example, when spark on yarn/hive/hbase/etc..., which user is used by
 default?
 The user which starts the spark job?
 Any suggestions related to impersonation?

 --
 Thanks,
 www.openkb.info
 (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)




Re: Possible bug in ClientBase.scala?

2014-07-13 Thread Chester Chen
Ron,
Which distribution and Version of Hadoop are you using ?

 I just looked at CDH5 (  hadoop-mapreduce-client-core-
2.3.0-cdh5.0.0),

MRJobConfig does have the field :

java.lang.String DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH;

Chester



On Sun, Jul 13, 2014 at 6:49 PM, Ron Gonzalez zlgonza...@yahoo.com wrote:

 Hi,
   I was doing programmatic submission of Spark yarn jobs and I saw code in
 ClientBase.getDefaultYarnApplicationClasspath():

 val field =
 classOf[MRJobConfig].getField(DEFAULT_YARN_APPLICATION_CLASSPATH)
 MRJobConfig doesn't have this field so the created launch env is
 incomplete. Workaround is to set yarn.application.classpath with the value
 from YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH.

 This results in having the spark job hang if the submission config is
 different from the default config. For example, if my resource manager port
 is 8050 instead of 8030, then the spark app is not able to register itself
 and stays in ACCEPTED state.

 I can easily fix this by changing this to YarnConfiguration instead of
 MRJobConfig but was wondering what the steps are for submitting a fix.

 Thanks,
 Ron

 Sent from my iPhone


Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Chester Chen
I don't have experience deploying to EC2.  can you use add.jar conf to add
the missing jar at runtime ?   I haven't tried this myself. Just a guess.


On Mon, Jul 7, 2014 at 12:16 PM, Chester Chen ches...@alpinenow.com wrote:

 with provided scope, you need to provide the provided jars at the
 runtime yourself. I guess in this case Hadoop jar files.


 On Mon, Jul 7, 2014 at 12:13 PM, Robert James srobertja...@gmail.com
 wrote:

 Thanks - that did solve my error, but instead got a different one:
   java.lang.NoClassDefFoundError:
 org/apache/hadoop/mapreduce/lib/input/FileInputFormat

 It seems like with that setting, spark can't find Hadoop.

 On 7/7/14, Koert Kuipers ko...@tresata.com wrote:
  spark has a setting to put user jars in front of classpath, which
 should do
  the trick.
  however i had no luck with this. see here:
 
  https://issues.apache.org/jira/browse/SPARK-1863
 
 
 
  On Mon, Jul 7, 2014 at 1:31 PM, Robert James srobertja...@gmail.com
  wrote:
 
  spark-submit includes a spark-assembly uber jar, which has older
  versions of many common libraries.  These conflict with some of the
  dependencies we need.  I have been racking my brain trying to find a
  solution (including experimenting with ProGuard), but haven't been
  able to: when we use spark-submit, we get NoMethodErrors, even though
  the code compiles fine, because the runtime classes are different than
  the compile time classes!
 
  Can someone recommend a solution? We are using scala, sbt, and
  sbt-assembly, but are happy using another tool (please provide
  instructions how to).
 
 





Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Chester Chen
@Andrew

  Yes, the link point to the same redirected


 http://localhost/proxy/application_1404443455764_0010/


  I suspect something todo with the cluster setup. I will let you know
once I found something.

Chester


On Mon, Jul 7, 2014 at 1:07 PM, Andrew Or and...@databricks.com wrote:

 @Yan, the UI should still work. As long as you look into the container
 that launches the driver, you will find the SparkUI address and port. Note
 that in yarn-cluster mode the Spark driver doesn't actually run in the
 Application Manager; just like the executors, it runs in a container that
 is launched by the Resource Manager after the Application Master requests
 the container resources. In contrast, in yarn-client mode, your driver is
 not launched in a container, but in the client process that launched your
 application (i.e. spark-submit), so the stdout of this program directly
 contains the SparkUI messages.

 @Chester, I'm not sure what has gone wrong as there are many factors at
 play here. When you go the Resource Manager UI, does the application URL
 link point you to the same SparkUI address as indicated in the logs? If so,
 this is the correct behavior. However, I believe the redirect error has
 little to do with Spark itself, but more to do with how you set up the
 cluster. I have actually run into this myself, but I haven't found a
 workaround. Let me know if you find anything.




 2014-07-07 12:07 GMT-07:00 Chester Chen ches...@alpinenow.com:

 As Andrew explained, the port is random rather than 4040, as the the spark
 driver is started in Application Master and the port is random selected.


 But I have the similar UI issue. I am running Yarn Cluster mode against
 my local CDH5 cluster.

 The log states
 14/07/07 11:59:29 INFO ui.SparkUI: Started SparkUI at
 http://10.0.0.63:58750

 


 but when you client the spark UI link (ApplicationMaster or

 http://10.0.0.63:58750), I will got a 404 with the redirect URI



  http://localhost/proxy/application_1404443455764_0010/



 Looking at the Spark code, notice that the proxy is reallya variable to 
 get the proxy at the yarn-site.xml http address. But when I specified the 
 value at yarn-site.xml, it still doesn't work for me.



 Oddly enough, it works for my co-worker on Pivotal HD cluster, therefore I 
 am still looking what's the difference in terms of cluster setup or 
 something else.


 Chester





 On Mon, Jul 7, 2014 at 11:42 AM, Andrew Or and...@databricks.com wrote:

 I will assume that you are running in yarn-cluster mode. Because the
 driver is launched in one of the containers, it doesn't make sense to
 expose port 4040 for the node that contains the container. (Imagine if
 multiple driver containers are launched on the same node. This will cause a
 port collision). If you're launching Spark from a gateway node that is
 physically near your worker nodes, then you can just launch your
 application in yarn-client mode, in which case the SparkUI will always be
 started on port 4040 on the node that you ran spark-submit on. The reason
 why sometimes you see the red text is because it appears only on the driver
 containers, not the executor containers. This is because SparkUI belongs to
 the SparkContext, which only exists on the driver.

 Andrew


 2014-07-07 11:20 GMT-07:00 Yan Fang yanfang...@gmail.com:

 Hi guys,

 Not sure if you  have similar issues. Did not find relevant tickets in
 JIRA. When I deploy the Spark Streaming to YARN, I have following two
 issues:

 1. The UI port is random. It is not default 4040. I have to look at the
 container's log to check the UI port. Is this suppose to be this way?

 2. Most of the time, the UI does not work. The difference between logs
 are (I ran the same program):






 *14/07/03 11:38:50 INFO spark.HttpServer: Starting HTTP Server14/07/03
 11:38:50 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/07/03 11:38:50 INFO
 server.AbstractConnector: Started SocketConnector@0.0.0.0:12026
 http://SocketConnector@0.0.0.0:1202614/07/03 11:38:51 INFO
 executor.CoarseGrainedExecutorBackend: Got assigned task 0 14/07/03
 11:38:51 INFO executor.Executor: Running task ID 0...*

 14/07/02 16:55:32 INFO spark.HttpServer: Starting HTTP Server
 14/07/02 16:55:32 INFO server.Server: jetty-8.y.z-SNAPSHOT
 14/07/02 16:55:32 INFO server.AbstractConnector: Started
 SocketConnector@0.0.0.0:14211




 *14/07/02 16:55:32 INFO ui.JettyUtils: Adding filter:
 org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter14/07/02 16:55:32
 INFO server.Server: jetty-8.y.z-SNAPSHOT14/07/02 16:55:32 INFO
 server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:21867
 http://SelectChannelConnector@0.0.0.0:21867 14/07/02 16:55:32 INFO
 ui.SparkUI: Started SparkUI at http://myNodeName:21867
 http://myNodeName:2186714/07/02 16:55:32 INFO
 cluster.YarnClusterScheduler: Created YarnClusterScheduler*

 When the red part comes, the UI works sometime. Any ideas? Thank you.

 Best,

 Fang, Yan
 yanfang...@gmail.com
 +1 (206) 849

Re: Shark Vs Spark SQL

2014-07-02 Thread Chester Chen
Yes,  they have announced that Shark is no longer under development and be
replaced with Spark SQL in Spark Summit 2014.

Chester


On Wed, Jul 2, 2014 at 3:53 PM, Subacini B subac...@gmail.com wrote:

 Hi,


 http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3cb75376b8-7a57-4161-b604-f919886cf...@gmail.com%3E

 This talks about  Shark backend will be replaced with Spark SQL engine in
 future.
 Does that mean Spark will continue to support Shark + Spark SQL for long
 term? OR
 After some period, Shark will be decommissioned ??

 Thanks
 Subacini



Re: Is spark context in local mode thread-safe?

2014-06-09 Thread Chester Chen
Matei, 
If we use different Akka actors to process different user's requests, (not 
different threads), is the SparkContext still safe to use for different users ? 

Yes, it would be nice to disable UI via configuration,especially when we 
develop locally. We use sbt-web plugin to debug tomcat code. If we can disable 
the UI http Server; it would be much simpler to handle than having two http 
containers to deal with. 

Chester



On Monday, June 9, 2014 4:35 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 


You currently can’t have multiple SparkContext objects in the same JVM, but 
within a SparkContext, all of the APIs are thread-safe so you can share that 
context between multiple threads. The other issue you’ll run into is that in 
each thread where you want to use Spark, you need to use SparkEnv.set(env) 
where “env” was obtained by SparkEnv.get in the thread that created the 
context. This requirement will hopefully go away soon.

Unfortunately there’s no way yet to disable the UI — feel free to open a JIRA 
for it, it shouldn’t be hard to do.

Matei


On Jun 9, 2014, at 3:50 PM, DB Tsai dbt...@stanford.edu wrote:

 Hi guys,
 
 We would like to use spark hadoop api to get the first couple hundred
 lines in design time to quickly show users the file-structure/meta
 data, and the values in those lines without launching the full spark
 job in cluster.
 
 Since we're web-based application, there will be multiple users using
 the spark hadoop api, for exmaple, sc.textFile(filePath). I wonder if
 those APIs are thread-safe in local mode (each user will have its own
 SparkContext object).
 
 Secondly, it seems that even in local mode, the jetty UI tracker will
 be lunched. For this kind of cheap operation, having jetty UI tracker
 for each operation will be very expensive. Is there a way to disable
 this behavior?
 
 Thanks.
 
 Sincerely,
 
 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Is spark context in local mode thread-safe?

2014-06-09 Thread Chester Chen
Matei, 
Thanks for the insight, we have to carefully consider our design. We 
are in the processing moving our system to Akka, it would be nice to use Akka 
all the way. But I understand the limitations. 

Thanks
Chester


On Monday, June 9, 2014 5:06 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 


In general you probably shouldn’t use actors for processing requests because 
Spark operations are blocking, and Akka only has a limited thread pool for each 
ActorSystem. You risk blocking all the threads with ongoing requests and not 
being able to service new ones. That said though, you can configure Akka to 
spawn more threads and in that case it would probably be okay. See 
http://doc.akka.io/docs/akka/snapshot/java/dispatchers.html for some details on 
Akka thread usage and how to configure it.

Matei



On Jun 9, 2014, at 4:54 PM, Chester Chen chesterxgc...@yahoo.com wrote:

Matei, 
If we use different Akka actors to process different user's requests, (not 
different threads), is the SparkContext still safe to use for different users 
? 


Yes, it would be nice to disable UI via configuration,especially when we 
develop locally. We use sbt-web plugin to debug tomcat code. If we can disable 
the UI http Server; it would be much simpler to handle than having two http 
containers to deal with. 


Chester





On Monday, June 9, 2014 4:35 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 


You currently can’t have multiple SparkContext objects in the same JVM, but 
within a SparkContext, all of the APIs are thread-safe so you can share that 
context between multiple threads. The other issue you’ll run into is that in 
each thread where you want to use Spark, you need to use SparkEnv.set(env) 
where “env” was obtained by SparkEnv.get in the thread that created the 
context. This requirement will hopefully go away soon.

Unfortunately there’s no way yet to disable the UI — feel free to open a JIRA 
for it, it shouldn’t be hard to do.

Matei


On Jun 9, 2014, at 3:50 PM, DB Tsai dbt...@stanford.edu wrote:

 Hi guys,
 
 We would like to use spark hadoop api to get the first couple hundred
 lines in design time to quickly show users the file-structure/meta
 data, and the values in those lines without launching the full spark
 job in cluster.
 
 Since we're web-based application, there will be multiple users using
 the spark hadoop api, for
 exmaple, sc.textFile(filePath). I wonder if
 those APIs are thread-safe in local mode (each user will have its own
 SparkContext object).
 
 Secondly, it seems that even in local mode, the jetty UI tracker will
 be lunched. For this kind of cheap operation, having jetty UI tracker
 for each operation will be very expensive. Is there a way to disable
 this behavior?
 
 Thanks.
 
 Sincerely,
 
 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai




[ANN]: Scala By The Bay Conference ( aka Silicon Valley Scala Symposium)

2014-04-30 Thread Chester Chen
Hi,  
     This is not related to Spark. But I thought you might be interested in the 
second SF Scala conference is coming this August. The SF Scala conference was 
called Sillicon Valley Scala Symposium last year.  From now on, it will be 
known as Scala By The Bay. 

http://www.scalabythebay.org

-- watch that space for announcements and the CFP!

Chester

Is Branch 1.0 build broken ?

2014-04-10 Thread Chester Chen
I just updated and got the following: 


[error] (external-mqtt/*:update) sbt.ResolveException: unresolved dependency: 
org.eclipse.paho#mqtt-client;0.4.0: not found
[error] Total time: 7 s, completed Apr 10, 2014 4:27:09 PM
Chesters-MacBook-Pro:spark chester$ git branch
* branch-1.0
  master

Looks like certain dependency mqtt-client resolver is not specified. 

Chester