Getting the execution times of spark job

2014-09-02 Thread Niranda Perera
Hi, I have been playing around with spark for a couple of days. I am using spark-1.0.1-bin-hadoop1 and the Java API. The main idea of the implementation is to run Hive queries on Spark. I used JavaHiveContext to achieve this (As per the examples). I have 2 questions. 1. I am wondering how I

Re: Getting the execution times of spark job

2014-09-02 Thread Zongheng Yang
For your second question: hql() (as well as sql()) does not launch a Spark job immediately; instead, it fires off the Spark SQL parser/optimizer/planner pipeline first, and a Spark job will be started after the a physical execution plan is selected. Therefore, your hand-rolled end-to-end

Re: Spark SQL Query and join different data sources.

2014-09-02 Thread Yin Huai
Actually, with HiveContext, you can join hive tables with registered temporary tables. On Fri, Aug 22, 2014 at 9:07 PM, chutium teng@gmail.com wrote: oops, thanks Yan, you are right, i got scala sqlContext.sql(select * from a join b).take(10) java.lang.RuntimeException: Table Not Found:

about spark assembly jar

2014-09-02 Thread scwf
hi, all I suggest spark not use assembly jar as default run-time dependency(spark-submit/spark-class depend on assembly jar),use a library of all 3rd dependency jar like hadoop/hive/hbase more reasonable. 1 assembly jar packaged all 3rd jars into a big one, so we need rebuild this jar if

Re: about spark assembly jar

2014-09-02 Thread Sean Owen
Hm, are you suggesting that the Spark distribution be a bag of 100 JARs? It doesn't quite seem reasonable. It does not remove version conflicts, just pushes them to run-time, which isn't good. The assembly is also necessary because that's where shading happens. In development, you want to run

Re: about spark assembly jar

2014-09-02 Thread scwf
yes, i am not sure what happens when building assembly jar and in my understanding it just package all the dependency jars to a big one. On 2014/9/2 16:45, Sean Owen wrote: Hm, are you suggesting that the Spark distribution be a bag of 100 JARs? It doesn't quite seem reasonable. It does not

Re: about spark assembly jar

2014-09-02 Thread Ye Xianjin
Sorry, The quick reply didn't cc the dev list. Sean, sometimes I have to use the spark-shell to confirm some behavior change. In that case, I have to reassembly the whole project. is there another way around, not use the the big jar in development? For the original question, I have no

Re: about spark assembly jar

2014-09-02 Thread scwf
Hi sean owen, here are some problems when i used assembly jar 1 i put spark-assembly-*.jar to the lib directory of my application, it throw compile error Error:scalac: Error: class scala.reflect.BeanInfo not found. scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not found.

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Will Benton
Zongheng pointed out in my SPARK-3329 PR (https://github.com/apache/spark/pull/2220) that Aaron had already fixed this issue but that it had gotten inadvertently clobbered by another patch. I don't know how the project handles this kind of problem, but I've rewritten my SPARK-3329 branch to

Re: about spark assembly jar

2014-09-02 Thread Sandy Ryza
This doesn't help for every dependency, but Spark provides an option to build the assembly jar without Hadoop and its dependencies. We make use of this in CDH packaging. -Sandy On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com wrote: Hi sean owen, here are some problems when i used

hive client.getAllPartitions in lookupRelation can take a very long time

2014-09-02 Thread chutium
in our hive warehouse there are many tables with a lot of partitions, such as scala hiveContext.sql(use db_external) scala val result = hiveContext.sql(show partitions et_fullorders).count result: Long = 5879 i noticed that this part of code:

hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread shane knapp
so, i had a meeting w/the databricks guys on friday and they recommended i send an email out to the list to say 'hi' and give you guys a quick intro. :) hi! i'm shane knapp, the new AMPLab devops engineer, and will be spending time getting the jenkins build infrastructure up to production

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Reynold Xin
Welcome, Shane! On Tuesday, September 2, 2014, shane knapp skn...@berkeley.edu wrote: so, i had a meeting w/the databricks guys on friday and they recommended i send an email out to the list to say 'hi' and give you guys a quick intro. :) hi! i'm shane knapp, the new AMPLab devops

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Nicholas Chammas
Hi Shane! Thank you for doing the Jenkins upgrade last week. It's nice to know that infrastructure is gonna get some dedicated TLC going forward. Welcome aboard! Nick On Tue, Sep 2, 2014 at 1:35 PM, shane knapp skn...@berkeley.edu wrote: so, i had a meeting w/the databricks guys on friday

Re: about spark assembly jar

2014-09-02 Thread Reynold Xin
Having a SSD help tremendously with assembly time. Without that, you can do the following in order for Spark to pick up the compiled classes before assembly at runtime. export SPARK_PREPEND_CLASSES=true On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza sandy.r...@cloudera.com wrote: This doesn't

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Patrick Wendell
Hey Shane, Thanks for your work so far and I'm really happy to see investment in this infrastructure. This is a key productivity tool for us and something we'd love to expand over time to improve the development process of Spark. - Patrick On Tue, Sep 2, 2014 at 10:47 AM, Nicholas Chammas

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Christopher Nguyen
Welcome, Shane. As a former prof and eng dir at Google, I've been expecting this to be a first-class engineering college subject. I just didn't expect it to come through this route :-) So congrats, and I hope you represent the beginning of a great new trend at universities. Sent while mobile.

Resource allocation

2014-09-02 Thread rapelly kartheek
Hi, I want to incorporate some intelligence while choosing the resources for rdd replication. I thought, if we replicate rdd on specially chosen nodes based on the capabilities, the next application that requires this rdd can be executed more efficiently. But, I found that an rdd creatd by an

Re: about spark assembly jar

2014-09-02 Thread Cheng Lian
Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :) Maybe we should add a developer notes page to document all these useful black magic. On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin r...@databricks.com wrote: Having a SSD help tremendously with assembly time. Without that, you can

Re: about spark assembly jar

2014-09-02 Thread Josh Rosen
SPARK_PREPEND_CLASSES is documented on the Spark Wiki (which could probably be easier to find):  https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On September 2, 2014 at 11:53:49 AM, Cheng Lian (lian.cs@gmail.com) wrote: Yea, SSD + SPARK_PREPEND_CLASSES totally

Re: about spark assembly jar

2014-09-02 Thread Cheng Lian
Cool, didn't notice that, thanks Josh! On Tue, Sep 2, 2014 at 11:55 AM, Josh Rosen rosenvi...@gmail.com wrote: SPARK_PREPEND_CLASSES is documented on the Spark Wiki (which could probably be easier to find): https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Henry Saputra
Welcome Shane =) - Henry On Tue, Sep 2, 2014 at 10:35 AM, shane knapp skn...@berkeley.edu wrote: so, i had a meeting w/the databricks guys on friday and they recommended i send an email out to the list to say 'hi' and give you guys a quick intro. :) hi! i'm shane knapp, the new AMPLab

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Cheng Lian
Welcome Shane! Glad to see that finally a hero jumping out to tame Jenkins :) On Tue, Sep 2, 2014 at 12:44 PM, Henry Saputra henry.sapu...@gmail.com wrote: Welcome Shane =) - Henry On Tue, Sep 2, 2014 at 10:35 AM, shane knapp skn...@berkeley.edu wrote: so, i had a meeting w/the

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Will Benton
+1 Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle JDK 8). best, wb - Original Message - From: Patrick Wendell pwend...@gmail.com To: dev@spark.apache.org Sent: Saturday, August 30, 2014 5:07:52 PM Subject: [VOTE] Release Apache Spark 1.1.0 (RC3) Please

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Cheng Lian
+1 - Tested Thrift server and SQL CLI locally on OSX 10.9. - Checked datanucleus dependencies in distribution tarball built by make-distribution.sh without SPARK_HIVE defined. ​ On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wi...@redhat.com wrote: +1 Tested Scala/MLlib apps on

Checkpointing Pregel

2014-09-02 Thread Jeffrey Picard
Hey guys, I’m trying to run connected components on graphs that end up running for a fairly large number of iterations (25-30) and take 5-6 hours. I find more than half the time I end up getting fetch failures and losing an executor after a number of iterations. Then it has to go back and

Ask something about spark

2014-09-02 Thread Sanghoon Lee
Hi, I am phoenixlee and a Spark programmer in Korea. And be a good chance this time, it tries to teach college students and office workers to Spark. This course will be done with the support of the government. Can I use the data(pictures, samples, etc.) in the spark homepage for this course? Of

Re: Ask something about spark

2014-09-02 Thread Reynold Xin
I think in general that is fine. It would be great if your slides come with proper attribution. On Tue, Sep 2, 2014 at 3:31 PM, Sanghoon Lee phoenixl...@gmail.com wrote: Hi, I am phoenixlee and a Spark programmer in Korea. And be a good chance this time, it tries to teach college students

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Reynold Xin
+1 On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian lian.cs@gmail.com wrote: +1 - Tested Thrift server and SQL CLI locally on OSX 10.9. - Checked datanucleus dependencies in distribution tarball built by make-distribution.sh without SPARK_HIVE defined. ​ On Tue, Sep 2, 2014 at

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Kan Zhang
+1 Verified PySpark InputFormat/OutputFormat examples. On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin r...@databricks.com wrote: +1 On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian lian.cs@gmail.com wrote: +1 - Tested Thrift server and SQL CLI locally on OSX 10.9. - Checked

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Michael Armbrust
+1 On Tue, Sep 2, 2014 at 5:18 PM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 Tested on Mac OS X. Matei On September 2, 2014 at 5:03:19 PM, Kan Zhang (kzh...@apache.org) wrote: +1 Verified PySpark InputFormat/OutputFormat examples. On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Denny Lee
+1  Tested on Mac OSX, Thrift Server, SparkSQL On September 2, 2014 at 17:29:29, Michael Armbrust (mich...@databricks.com) wrote: +1 On Tue, Sep 2, 2014 at 5:18 PM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 Tested on Mac OS X. Matei On September 2, 2014 at

RE: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Sean McNamara
+1 From: Patrick Wendell [pwend...@gmail.com] Sent: Saturday, August 30, 2014 4:08 PM To: dev@spark.apache.org Subject: [VOTE] Release Apache Spark 1.1.0 (RC3) Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be

RE: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Jeremy Freeman
+1 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8211.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: quick jenkins restart

2014-09-02 Thread shane knapp
and we're back and building! On Tue, Sep 2, 2014 at 5:07 PM, shane knapp skn...@berkeley.edu wrote: since our queue is really short, i'm waiting for a couple of builds to finish and will be restarting jenkins to install/update some plugins. the github pull request builder looks like it has

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Paolo Platter
+1 Tested on HDP 2.1 Sandbox, Thrift Server with Simba Shark ODBC Paolo Da: Jeremy Freemanmailto:freeman.jer...@gmail.com Data invio: ?mercoled?? ?3? ?settembre? ?2014 ?02?:?34 A: d...@spark.incubator.apache.orgmailto:d...@spark.incubator.apache.org +1 -- View this message in context:

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Nicholas Chammas
In light of the discussion on SPARK-, I'll revoke my -1 vote. The issue does not appear to be serious. On Sun, Aug 31, 2014 at 5:14 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: -1: I believe I've found a regression from 1.0.2. The report is captured in SPARK-

Re: about spark assembly jar

2014-09-02 Thread scwf
Yea, SSD + SPARK_PREPEND_CLASSES is great for iterative development! Then why it is ok with a bag of 3rd jars but throw error with assembly jar, any one have idea? On 2014/9/3 2:57, Cheng Lian wrote: Cool, didn't notice that, thanks Josh! On Tue, Sep 2, 2014 at 11:55 AM, Josh Rosen

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Patrick Wendell
Thanks everyone for voting on this. There were two minor issues (one a blocker) were found that warrant cutting a new RC. For those who voted +1 on this release, I'd encourage you to +1 rc4 when it comes out unless you have been testing issues specific to the EC2 scripts. This will move the

Re: [Spark SQL] off-heap columnar store

2014-09-02 Thread Evan Chan
On Sun, Aug 31, 2014 at 8:27 PM, Ian O'Connell i...@ianoconnell.com wrote: I'm not sure what you mean here? Parquet is at its core just a format, you could store that data anywhere. Though it sounds like you saying, correct me if i'm wrong: you basically want a columnar abstraction layer