Re: Will Spark-SQL support vectorized query engine someday?
I don't know if there is a list, but in general running performance profiler can identify a lot of things... On Tue, Jan 20, 2015 at 12:30 AM, Xuelin Cao xuelincao2...@gmail.com wrote: Thanks, Reynold Regarding the lower hanging fruits, can you give me some example? Where can I find them in JIRA? On Tue, Jan 20, 2015 at 3:55 PM, Reynold Xin r...@databricks.com wrote: It will probably eventually make its way into part of the query engine, one way or another. Note that there are in general a lot of other lower hanging fruits before you have to do vectorization. As far as I know, Hive doesn't really have vectorization because the vectorization in Hive is simply writing everything in small batches, in order to avoid the virtual function call overhead, and hoping the JVM can unroll some of the loops. There is no SIMD involved. Something that is pretty useful, which isn't exactly from vectorization but comes from similar lines of research, is being able to push predicates down into the columnar compression encoding. For example, one can turn string comparisons into integer comparisons. These will probably give much larger performance improvements in common queries. On Mon, Jan 19, 2015 at 6:27 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, Correct me if I were wrong. It looks like, the current version of Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical operator produces a tuple by recursively call child-execute . There are papers that illustrate the benefits of vectorized query engine. And Hive-Stinger also embrace this style. So, the question is, will Spark-SQL give a support to vectorized query execution someday? Thanks
not found: type LocalSparkContext
Hi all, When I was trying to write a test on my spark application I met ``` Error:(14, 43) not found: type LocalSparkContext class HyperANFSuite extends FunSuite with LocalSparkContext { ``` At the source code of spark-core I could not found LocalSparkContext, thus I wonder how to write a test like [this] ( https://github.com/apache/spark/blob/master/graphx/src/test/scala/org/apache/spark/graphx/lib/ConnectedComponentsSuite.scala ) Alcaid
Re: Spark client reconnect to driver in yarn-cluster deployment mode
Hi Preeze, Is there any designed way that the client connects back to the driver (still running in YARN) for collecting results at a later stage? No, there is not support built into Spark for this. For this to happen seamlessly the driver will have to start a server (pull model) or send the results to some other server once the jobs complete (push model), both of which add complexity to the driver. Alternatively, you can just poll on the output files that your application produces; e.g. you can have your driver write the results of a count to a file and poll on that file. Something like that. -Andrew 2015-01-19 5:59 GMT-08:00 Romi Kuntsman r...@totango.com: in yarn-client mode it only controls the environment of the executor launcher So you either use yarn-client mode, and then your app keeps running and controlling the process Or you use yarn-cluster mode, and then you send a jar to YARN, and that jar should have code to report the result back to you *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Thu, Jan 15, 2015 at 1:52 PM, preeze etan...@gmail.com wrote: From the official spark documentation (http://spark.apache.org/docs/1.2.0/running-on-yarn.html): In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. Is there any designed way that the client connects back to the driver (still running in YARN) for collecting results at a later stage? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-client-reconnect-to-driver-in-yarn-cluster-deployment-mode-tp10122.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: not found: type LocalSparkContext
It's declared here: https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/LocalSparkContext.scala I assume you're already importing LocalSparkContext, but since the test classes aren't included in Spark packages, you'll also need to package them up in order to use them in your application (viz., outside of Spark). best, wb - Original Message - From: James alcaid1...@gmail.com To: dev@spark.apache.org Sent: Tuesday, January 20, 2015 6:35:07 AM Subject: not found: type LocalSparkContext Hi all, When I was trying to write a test on my spark application I met ``` Error:(14, 43) not found: type LocalSparkContext class HyperANFSuite extends FunSuite with LocalSparkContext { ``` At the source code of spark-core I could not found LocalSparkContext, thus I wonder how to write a test like [this] ( https://github.com/apache/spark/blob/master/graphx/src/test/scala/org/apache/spark/graphx/lib/ConnectedComponentsSuite.scala ) Alcaid - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spectral clustering
Fan and Stephen (cc'ed) are working on this feature. They will update the JIRA page and report progress soon. -Xiangrui On Fri, Jan 16, 2015 at 12:04 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Hi, thinking of picking up this Jira ticket: https://issues.apache.org/jira/browse/SPARK-4259 Anyone done any work on this to date? Any thoughts on it before we go too far in? Thanks! Best Andrew - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spectral clustering
Awesome, thanks On Tue, Jan 20, 2015 at 12:56 PM, Xiangrui Meng men...@gmail.com wrote: Fan and Stephen (cc'ed) are working on this feature. They will update the JIRA page and report progress soon. -Xiangrui On Fri, Jan 16, 2015 at 12:04 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Hi, thinking of picking up this Jira ticket: https://issues.apache.org/jira/browse/SPARK-4259 Anyone done any work on this to date? Any thoughts on it before we go too far in? Thanks! Best Andrew
Standardized Spark dev environment
What do y'all think of creating a standardized Spark development environment, perhaps encoded as a Vagrantfile, and publishing it under `dev/`? The goal would be to make it easier for new developers to get started with all the right configs and tools pre-installed. If we use something like Vagrant, we may even be able to make it so that a single Vagrantfile creates equivalent development environments across OS X, Linux, and Windows, without having to do much (or any) OS-specific work. I imagine for committers and regular contributors, this exercise may seem pointless, since y'all are probably already very comfortable with your workflow. I wonder, though, if any of you think this would be worthwhile as a improvement to the new Spark developer experience. Nick
Re: Standardized Spark dev environment
Great suggestion. On Jan 20, 2015 7:14 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: What do y'all think of creating a standardized Spark development environment, perhaps encoded as a Vagrantfile, and publishing it under `dev/`? The goal would be to make it easier for new developers to get started with all the right configs and tools pre-installed. If we use something like Vagrant, we may even be able to make it so that a single Vagrantfile creates equivalent development environments across OS X, Linux, and Windows, without having to do much (or any) OS-specific work. I imagine for committers and regular contributors, this exercise may seem pointless, since y'all are probably already very comfortable with your workflow. I wonder, though, if any of you think this would be worthwhile as a improvement to the new Spark developer experience. Nick
Re: Standardized Spark dev environment
How many profiles (hadoop / hive /scala) would this development environment support ? Cheers On Tue, Jan 20, 2015 at 4:13 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: What do y'all think of creating a standardized Spark development environment, perhaps encoded as a Vagrantfile, and publishing it under `dev/`? The goal would be to make it easier for new developers to get started with all the right configs and tools pre-installed. If we use something like Vagrant, we may even be able to make it so that a single Vagrantfile creates equivalent development environments across OS X, Linux, and Windows, without having to do much (or any) OS-specific work. I imagine for committers and regular contributors, this exercise may seem pointless, since y'all are probably already very comfortable with your workflow. I wonder, though, if any of you think this would be worthwhile as a improvement to the new Spark developer experience. Nick
Re: Standardized Spark dev environment
How many profiles (hadoop / hive /scala) would this development environment support ? As many as we want. We probably want to cover a good chunk of the build matrix https://issues.apache.org/jira/browse/SPARK-2004 that Spark officially supports. What does this provide, concretely? It provides a reliable way to create a “good” Spark development environment. Roughly speaking, this probably should mean an environment that matches Jenkins, since that’s where we run “official” testing and builds. For example, Spark has to run on Java 6 and Python 2.6. When devs build and run Spark locally, we can make sure they’re doing it on these versions of the languages with a simple vagrant up. Nate, could you comment on how something like this would relate to the Bigtop effort? http://chapeau.freevariable.com/2014/08/jvm-test-docker.html Will, that’s pretty sweet. I tried something similar a few months ago as an experiment to try building/testing Spark within a container. Here’s the shell script I used https://gist.github.com/nchammas/60b04141f3b9f053faaa against the base CentOS Docker image to setup an environment ready to build and test Spark. We want to run Spark unit tests within containers on Jenkins, so it might make sense to develop a single Docker image that can be used as both a “dev environment” as well as execution container on Jenkins. Perhaps that’s the approach to take instead of looking into Vagrant. Nick On Tue Jan 20 2015 at 8:22:41 PM Will Benton wi...@redhat.com wrote: Hey Nick, I did something similar with a Docker image last summer; I haven't updated the images to cache the dependencies for the current Spark master, but it would be trivial to do so: http://chapeau.freevariable.com/2014/08/jvm-test-docker.html best, wb - Original Message - From: Nicholas Chammas nicholas.cham...@gmail.com To: Spark dev list dev@spark.apache.org Sent: Tuesday, January 20, 2015 6:13:31 PM Subject: Standardized Spark dev environment What do y'all think of creating a standardized Spark development environment, perhaps encoded as a Vagrantfile, and publishing it under `dev/`? The goal would be to make it easier for new developers to get started with all the right configs and tools pre-installed. If we use something like Vagrant, we may even be able to make it so that a single Vagrantfile creates equivalent development environments across OS X, Linux, and Windows, without having to do much (or any) OS-specific work. I imagine for committers and regular contributors, this exercise may seem pointless, since y'all are probably already very comfortable with your workflow. I wonder, though, if any of you think this would be worthwhile as a improvement to the new Spark developer experience. Nick
Re: Standardized Spark dev environment
I can comment on both... hi will and nate :) 1) Will's Dockerfile solution is the most simple direct solution to the dev environment question : its a efficient way to build and develop spark environments for dev/test.. It would be cool to put that Dockerfile (and/or maybe a shell script which uses it) in the top level of spark as the build entry point. For total platform portability, u could wrap in a vagrantfile to launch a lightweight vm, so that windows worked equally well. 2) However, since nate mentioned vagrant and bigtop, i have to chime in :) the vagrant recipes in bigtop are a nice reference deployment of how to deploy spark in a heterogenous hadoop style environment, and tighter integration testing w/ bigtop for spark releases would be lovely ! The vagrant stuff use puppet to deploy an n node VM or docker based cluster, in which users can easily select components (including spark,yarn,hbase,hadoop,etc...) by simnply editing a YAML file : https://github.com/apache/bigtop/blob/master/bigtop-deploy/vm/vagrant-puppet/vagrantconfig.yaml As nate said, it would be alot of fun to get more cross collaboration between the spark and bigtop communities. Input on how we can better integrate spark (wether its spork, hbase integration, smoke tests aroudn the mllib stuff, or whatever, is always welcome ) On Tue, Jan 20, 2015 at 10:21 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: How many profiles (hadoop / hive /scala) would this development environment support ? As many as we want. We probably want to cover a good chunk of the build matrix https://issues.apache.org/jira/browse/SPARK-2004 that Spark officially supports. What does this provide, concretely? It provides a reliable way to create a “good” Spark development environment. Roughly speaking, this probably should mean an environment that matches Jenkins, since that’s where we run “official” testing and builds. For example, Spark has to run on Java 6 and Python 2.6. When devs build and run Spark locally, we can make sure they’re doing it on these versions of the languages with a simple vagrant up. Nate, could you comment on how something like this would relate to the Bigtop effort? http://chapeau.freevariable.com/2014/08/jvm-test-docker.html Will, that’s pretty sweet. I tried something similar a few months ago as an experiment to try building/testing Spark within a container. Here’s the shell script I used https://gist.github.com/nchammas/60b04141f3b9f053faaa against the base CentOS Docker image to setup an environment ready to build and test Spark. We want to run Spark unit tests within containers on Jenkins, so it might make sense to develop a single Docker image that can be used as both a “dev environment” as well as execution container on Jenkins. Perhaps that’s the approach to take instead of looking into Vagrant. Nick On Tue Jan 20 2015 at 8:22:41 PM Will Benton wi...@redhat.com wrote: Hey Nick, I did something similar with a Docker image last summer; I haven't updated the images to cache the dependencies for the current Spark master, but it would be trivial to do so: http://chapeau.freevariable.com/2014/08/jvm-test-docker.html best, wb - Original Message - From: Nicholas Chammas nicholas.cham...@gmail.com To: Spark dev list dev@spark.apache.org Sent: Tuesday, January 20, 2015 6:13:31 PM Subject: Standardized Spark dev environment What do y'all think of creating a standardized Spark development environment, perhaps encoded as a Vagrantfile, and publishing it under `dev/`? The goal would be to make it easier for new developers to get started with all the right configs and tools pre-installed. If we use something like Vagrant, we may even be able to make it so that a single Vagrantfile creates equivalent development environments across OS X, Linux, and Windows, without having to do much (or any) OS-specific work. I imagine for committers and regular contributors, this exercise may seem pointless, since y'all are probably already very comfortable with your workflow. I wonder, though, if any of you think this would be worthwhile as a improvement to the new Spark developer experience. Nick -- jay vyas
RE: Standardized Spark dev environment
If there is some interest in more standardization and setup of dev/test environments spark community might be interested in starting to participate in apache bigtop effort: http://bigtop.apache.org/ While the project had its start and initial focus on packaging, testing, deploying Hadoop/hdfs related stack its looking like we will be targeting data engineers going forward, thus spark is looking to become bigger central piece to bigtop effort as the project moves towards a v1 release. We will be doing a bigtop/bigdata workshop late Feb at the SocalLinux Conference: http://www.socallinuxexpo.org/scale/13x Right now scoping some content that will be getting started spark related for the event, targeted intro of bigtop/spark puppet powered deployment components going into the event as well. Also the group will be holding a meetup at Amazon's Palo Alto office on Jan 27th if any folks are interested. Nate -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Tuesday, January 20, 2015 5:09 PM To: Nicholas Chammas Cc: dev Subject: Re: Standardized Spark dev environment My concern would mostly be maintenance. It adds to an already very complex build. It only assists developers who are a small audience. What does this provide, concretely? On Jan 21, 2015 12:14 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: What do y'all think of creating a standardized Spark development environment, perhaps encoded as a Vagrantfile, and publishing it under `dev/`? The goal would be to make it easier for new developers to get started with all the right configs and tools pre-installed. If we use something like Vagrant, we may even be able to make it so that a single Vagrantfile creates equivalent development environments across OS X, Linux, and Windows, without having to do much (or any) OS-specific work. I imagine for committers and regular contributors, this exercise may seem pointless, since y'all are probably already very comfortable with your workflow. I wonder, though, if any of you think this would be worthwhile as a improvement to the new Spark developer experience. Nick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Standardized Spark dev environment
Hey Nick, I did something similar with a Docker image last summer; I haven't updated the images to cache the dependencies for the current Spark master, but it would be trivial to do so: http://chapeau.freevariable.com/2014/08/jvm-test-docker.html best, wb - Original Message - From: Nicholas Chammas nicholas.cham...@gmail.com To: Spark dev list dev@spark.apache.org Sent: Tuesday, January 20, 2015 6:13:31 PM Subject: Standardized Spark dev environment What do y'all think of creating a standardized Spark development environment, perhaps encoded as a Vagrantfile, and publishing it under `dev/`? The goal would be to make it easier for new developers to get started with all the right configs and tools pre-installed. If we use something like Vagrant, we may even be able to make it so that a single Vagrantfile creates equivalent development environments across OS X, Linux, and Windows, without having to do much (or any) OS-specific work. I imagine for committers and regular contributors, this exercise may seem pointless, since y'all are probably already very comfortable with your workflow. I wonder, though, if any of you think this would be worthwhile as a improvement to the new Spark developer experience. Nick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: not found: type LocalSparkContext
I could not correctly import org.apache.spark.LocalSparkContext, I use sbt on Intellij for developing,here is my build sbt. ``` libraryDependencies += org.apache.spark %% spark-core % 1.2.0 libraryDependencies += org.apache.spark %% spark-graphx % 1.2.0 libraryDependencies += com.clearspring.analytics % stream % 2.7.0 libraryDependencies += org.scalatest % scalatest_2.10 % 2.0 resolvers += Akka Repository at http://repo.akka.io/releases/; ``` I think maybe I have make some mistakes on the library setting, as a new developer of spark application, I wonder what is the standard procedure of developing a spark application. Any reply is appreciated. Alcaid 2015-01-21 2:05 GMT+08:00 Will Benton wi...@redhat.com: It's declared here: https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/LocalSparkContext.scala I assume you're already importing LocalSparkContext, but since the test classes aren't included in Spark packages, you'll also need to package them up in order to use them in your application (viz., outside of Spark). best, wb - Original Message - From: James alcaid1...@gmail.com To: dev@spark.apache.org Sent: Tuesday, January 20, 2015 6:35:07 AM Subject: not found: type LocalSparkContext Hi all, When I was trying to write a test on my spark application I met ``` Error:(14, 43) not found: type LocalSparkContext class HyperANFSuite extends FunSuite with LocalSparkContext { ``` At the source code of spark-core I could not found LocalSparkContext, thus I wonder how to write a test like [this] ( https://github.com/apache/spark/blob/master/graphx/src/test/scala/org/apache/spark/graphx/lib/ConnectedComponentsSuite.scala ) Alcaid
Re: GraphX ShortestPaths backwards?
I created https://issues.apache.org/jira/browse/SPARK-5343 for this. - Original Message - From: Michael Malak michaelma...@yahoo.com To: dev@spark.apache.org dev@spark.apache.org Cc: Sent: Monday, January 19, 2015 5:09 PM Subject: GraphX ShortestPaths backwards? GraphX ShortestPaths seems to be following edges backwards instead of forwards: import org.apache.spark.graphx._ val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L, lib.ShortestPaths.run(g,Array(3)).vertices.collect res1: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 - 0)), (2,Map())) lib.ShortestPaths.run(g,Array(1)).vertices.collect res2: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 - 0)), (3,Map(1 - 2)), (2,Map(1 - 1))) If I am not mistaken about my assessment, then I believe the following changes will make it run forward: Change one occurrence of src to dst in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L64 Change three occurrences of dst to src in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L65 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: not found: type LocalSparkContext
You don't need the LocalSparkContext. It is only for Spark's own unit test. You can just create a SparkContext and use it in your unit tests, e.g. val sc = new SparkContext(local, my test app, new SparkConf) On Tue, Jan 20, 2015 at 7:27 PM, James alcaid1...@gmail.com wrote: I could not correctly import org.apache.spark.LocalSparkContext, I use sbt on Intellij for developing,here is my build sbt. ``` libraryDependencies += org.apache.spark %% spark-core % 1.2.0 libraryDependencies += org.apache.spark %% spark-graphx % 1.2.0 libraryDependencies += com.clearspring.analytics % stream % 2.7.0 libraryDependencies += org.scalatest % scalatest_2.10 % 2.0 resolvers += Akka Repository at http://repo.akka.io/releases/; ``` I think maybe I have make some mistakes on the library setting, as a new developer of spark application, I wonder what is the standard procedure of developing a spark application. Any reply is appreciated. Alcaid 2015-01-21 2:05 GMT+08:00 Will Benton wi...@redhat.com: It's declared here: https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/LocalSparkContext.scala I assume you're already importing LocalSparkContext, but since the test classes aren't included in Spark packages, you'll also need to package them up in order to use them in your application (viz., outside of Spark). best, wb - Original Message - From: James alcaid1...@gmail.com To: dev@spark.apache.org Sent: Tuesday, January 20, 2015 6:35:07 AM Subject: not found: type LocalSparkContext Hi all, When I was trying to write a test on my spark application I met ``` Error:(14, 43) not found: type LocalSparkContext class HyperANFSuite extends FunSuite with LocalSparkContext { ``` At the source code of spark-core I could not found LocalSparkContext, thus I wonder how to write a test like [this] ( https://github.com/apache/spark/blob/master/graphx/src/test/scala/org/apache/spark/graphx/lib/ConnectedComponentsSuite.scala ) Alcaid
Re: Standardized Spark dev environment
To respond to the original suggestion by Nick. I always thought it would be useful to have a Docker image on which we run the tests and build releases, so that we could have a consistent environment that other packagers or people trying to exhaustively run Spark tests could replicate (or at least look at) to understand exactly how we recommend building Spark. Sean - do you think that is too high of overhead? In terms of providing images that we encourage as standard deployment images of Spark and want to make portable across environments, that's a much larger project and one with higher associated maintenance overhead. So I'd be interested in seeing that evolve as its own project (spark-deploy) or something associated with bigtop, etc. - Patrick On Tue, Jan 20, 2015 at 10:30 PM, Paolo Platter paolo.plat...@agilelab.it wrote: Hi all, I also tried the docker way and it works well. I suggest to look at sequenceiq/spark dockers, they are very active on that field. Paolo Inviata dal mio Windows Phone Da: jay vyasmailto:jayunit100.apa...@gmail.com Inviato: 21/01/2015 04:45 A: Nicholas Chammasmailto:nicholas.cham...@gmail.com Cc: Will Bentonmailto:wi...@redhat.com; Spark dev listmailto:dev@spark.apache.org Oggetto: Re: Standardized Spark dev environment I can comment on both... hi will and nate :) 1) Will's Dockerfile solution is the most simple direct solution to the dev environment question : its a efficient way to build and develop spark environments for dev/test.. It would be cool to put that Dockerfile (and/or maybe a shell script which uses it) in the top level of spark as the build entry point. For total platform portability, u could wrap in a vagrantfile to launch a lightweight vm, so that windows worked equally well. 2) However, since nate mentioned vagrant and bigtop, i have to chime in :) the vagrant recipes in bigtop are a nice reference deployment of how to deploy spark in a heterogenous hadoop style environment, and tighter integration testing w/ bigtop for spark releases would be lovely ! The vagrant stuff use puppet to deploy an n node VM or docker based cluster, in which users can easily select components (including spark,yarn,hbase,hadoop,etc...) by simnply editing a YAML file : https://github.com/apache/bigtop/blob/master/bigtop-deploy/vm/vagrant-puppet/vagrantconfig.yaml As nate said, it would be alot of fun to get more cross collaboration between the spark and bigtop communities. Input on how we can better integrate spark (wether its spork, hbase integration, smoke tests aroudn the mllib stuff, or whatever, is always welcome ) On Tue, Jan 20, 2015 at 10:21 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: How many profiles (hadoop / hive /scala) would this development environment support ? As many as we want. We probably want to cover a good chunk of the build matrix https://issues.apache.org/jira/browse/SPARK-2004 that Spark officially supports. What does this provide, concretely? It provides a reliable way to create a good Spark development environment. Roughly speaking, this probably should mean an environment that matches Jenkins, since that's where we run official testing and builds. For example, Spark has to run on Java 6 and Python 2.6. When devs build and run Spark locally, we can make sure they're doing it on these versions of the languages with a simple vagrant up. Nate, could you comment on how something like this would relate to the Bigtop effort? http://chapeau.freevariable.com/2014/08/jvm-test-docker.html Will, that's pretty sweet. I tried something similar a few months ago as an experiment to try building/testing Spark within a container. Here's the shell script I used https://gist.github.com/nchammas/60b04141f3b9f053faaa against the base CentOS Docker image to setup an environment ready to build and test Spark. We want to run Spark unit tests within containers on Jenkins, so it might make sense to develop a single Docker image that can be used as both a dev environment as well as execution container on Jenkins. Perhaps that's the approach to take instead of looking into Vagrant. Nick On Tue Jan 20 2015 at 8:22:41 PM Will Benton wi...@redhat.com wrote: Hey Nick, I did something similar with a Docker image last summer; I haven't updated the images to cache the dependencies for the current Spark master, but it would be trivial to do so: http://chapeau.freevariable.com/2014/08/jvm-test-docker.html best, wb - Original Message - From: Nicholas Chammas nicholas.cham...@gmail.com To: Spark dev list dev@spark.apache.org Sent: Tuesday, January 20, 2015 6:13:31 PM Subject: Standardized Spark dev environment What do y'all think of creating a standardized Spark development environment, perhaps encoded as a Vagrantfile, and publishing it under `dev/`? The goal would be to make it
KNN for large data set
Hi all, Please help me to find out best way for K-nearest neighbor using spark for large data sets.
R: Standardized Spark dev environment
Hi all, I also tried the docker way and it works well. I suggest to look at sequenceiq/spark dockers, they are very active on that field. Paolo Inviata dal mio Windows Phone Da: jay vyasmailto:jayunit100.apa...@gmail.com Inviato: 21/01/2015 04:45 A: Nicholas Chammasmailto:nicholas.cham...@gmail.com Cc: Will Bentonmailto:wi...@redhat.com; Spark dev listmailto:dev@spark.apache.org Oggetto: Re: Standardized Spark dev environment I can comment on both... hi will and nate :) 1) Will's Dockerfile solution is the most simple direct solution to the dev environment question : its a efficient way to build and develop spark environments for dev/test.. It would be cool to put that Dockerfile (and/or maybe a shell script which uses it) in the top level of spark as the build entry point. For total platform portability, u could wrap in a vagrantfile to launch a lightweight vm, so that windows worked equally well. 2) However, since nate mentioned vagrant and bigtop, i have to chime in :) the vagrant recipes in bigtop are a nice reference deployment of how to deploy spark in a heterogenous hadoop style environment, and tighter integration testing w/ bigtop for spark releases would be lovely ! The vagrant stuff use puppet to deploy an n node VM or docker based cluster, in which users can easily select components (including spark,yarn,hbase,hadoop,etc...) by simnply editing a YAML file : https://github.com/apache/bigtop/blob/master/bigtop-deploy/vm/vagrant-puppet/vagrantconfig.yaml As nate said, it would be alot of fun to get more cross collaboration between the spark and bigtop communities. Input on how we can better integrate spark (wether its spork, hbase integration, smoke tests aroudn the mllib stuff, or whatever, is always welcome ) On Tue, Jan 20, 2015 at 10:21 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: How many profiles (hadoop / hive /scala) would this development environment support ? As many as we want. We probably want to cover a good chunk of the build matrix https://issues.apache.org/jira/browse/SPARK-2004 that Spark officially supports. What does this provide, concretely? It provides a reliable way to create a “good” Spark development environment. Roughly speaking, this probably should mean an environment that matches Jenkins, since that’s where we run “official” testing and builds. For example, Spark has to run on Java 6 and Python 2.6. When devs build and run Spark locally, we can make sure they’re doing it on these versions of the languages with a simple vagrant up. Nate, could you comment on how something like this would relate to the Bigtop effort? http://chapeau.freevariable.com/2014/08/jvm-test-docker.html Will, that’s pretty sweet. I tried something similar a few months ago as an experiment to try building/testing Spark within a container. Here’s the shell script I used https://gist.github.com/nchammas/60b04141f3b9f053faaa against the base CentOS Docker image to setup an environment ready to build and test Spark. We want to run Spark unit tests within containers on Jenkins, so it might make sense to develop a single Docker image that can be used as both a “dev environment” as well as execution container on Jenkins. Perhaps that’s the approach to take instead of looking into Vagrant. Nick On Tue Jan 20 2015 at 8:22:41 PM Will Benton wi...@redhat.com wrote: Hey Nick, I did something similar with a Docker image last summer; I haven't updated the images to cache the dependencies for the current Spark master, but it would be trivial to do so: http://chapeau.freevariable.com/2014/08/jvm-test-docker.html best, wb - Original Message - From: Nicholas Chammas nicholas.cham...@gmail.com To: Spark dev list dev@spark.apache.org Sent: Tuesday, January 20, 2015 6:13:31 PM Subject: Standardized Spark dev environment What do y'all think of creating a standardized Spark development environment, perhaps encoded as a Vagrantfile, and publishing it under `dev/`? The goal would be to make it easier for new developers to get started with all the right configs and tools pre-installed. If we use something like Vagrant, we may even be able to make it so that a single Vagrantfile creates equivalent development environments across OS X, Linux, and Windows, without having to do much (or any) OS-specific work. I imagine for committers and regular contributors, this exercise may seem pointless, since y'all are probably already very comfortable with your workflow. I wonder, though, if any of you think this would be worthwhile as a improvement to the new Spark developer experience. Nick -- jay vyas
Re: Is there any way to support multiple users executing SQL on thrift server?
Hey Yi, I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would like to investigate this issue later. Would you please open an JIRA for it? Thanks! Cheng On 1/19/15 1:00 AM, Yi Tian wrote: Is there any way to support multiple users executing SQL on one thrift server? I think there are some problems for spark 1.2.0, for example: 1. Start thrift server with user A 2. Connect to thrift server via beeline with user B 3. Execute “insert into table dest select … from table src” then we found these items on hdfs: |drwxr-xr-x - B supergroup 0 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1 drwxr-xr-x - B supergroup 0 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary drwxr-xr-x - B supergroup 0 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0 drwxr-xr-x - A supergroup 0 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/_temporary drwxr-xr-x - A supergroup 0 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/task_201501161642_0022_m_00 -rw-r--r-- 3 A supergroup 2671 2015-01-16 16:42 /tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/task_201501161642_0022_m_00/part-0 | You can see all the temporary path created on driver side (thrift server side) is owned by user B (which is what we expected). But all the output data created on executor side is owned by user A, (which is NOT what we expected). error owner of the output data cause |org.apache.hadoop.security.AccessControlException| while the driver side moving output data into |dest| table. Is anyone know how to resolve this problem?