Re: Will Spark-SQL support vectorized query engine someday?

2015-01-20 Thread Reynold Xin
I don't know if there is a list, but in general running performance
profiler can identify a lot of things...

On Tue, Jan 20, 2015 at 12:30 AM, Xuelin Cao xuelincao2...@gmail.com
wrote:


 Thanks, Reynold

   Regarding the lower hanging fruits, can you give me some example?
 Where can I find them in JIRA?


 On Tue, Jan 20, 2015 at 3:55 PM, Reynold Xin r...@databricks.com wrote:

 It will probably eventually make its way into part of the query engine,
 one way or another. Note that there are in general a lot of other lower
 hanging fruits before you have to do vectorization.

 As far as I know, Hive doesn't really have vectorization because the
 vectorization in Hive is simply writing everything in small batches, in
 order to avoid the virtual function call overhead, and hoping the JVM can
 unroll some of the loops. There is no SIMD involved.

 Something that is pretty useful, which isn't exactly from vectorization
 but comes from similar lines of research, is being able to push predicates
 down into the columnar compression encoding. For example, one can turn
 string comparisons into integer comparisons. These will probably give much
 larger performance improvements in common queries.


 On Mon, Jan 19, 2015 at 6:27 PM, Xuelin Cao xuelincao2...@gmail.com
 wrote:

 Hi,

  Correct me if I were wrong. It looks like, the current version of
 Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical
 operator produces a tuple by recursively call child-execute .

  There are papers that illustrate the benefits of vectorized query
 engine. And Hive-Stinger also embrace this style.

  So, the question is, will Spark-SQL give a support to vectorized
 query
 execution someday?

  Thanks






not found: type LocalSparkContext

2015-01-20 Thread James
Hi all,

When I was trying to write a test on my spark application I met

```
Error:(14, 43) not found: type LocalSparkContext
class HyperANFSuite extends FunSuite with LocalSparkContext {
```

At the source code of spark-core I could not found LocalSparkContext,
thus I wonder how to write a test like [this] (
https://github.com/apache/spark/blob/master/graphx/src/test/scala/org/apache/spark/graphx/lib/ConnectedComponentsSuite.scala
)

Alcaid


Re: Spark client reconnect to driver in yarn-cluster deployment mode

2015-01-20 Thread Andrew Or
Hi Preeze,

 Is there any designed way that the client connects back to the driver
(still running in YARN) for collecting results at a later stage?

No, there is not support built into Spark for this. For this to happen
seamlessly the driver will have to start a server (pull model) or send the
results to some other server once the jobs complete (push model), both of
which add complexity to the driver. Alternatively, you can just poll on the
output files that your application produces; e.g. you can have your driver
write the results of a count to a file and poll on that file. Something
like that.

-Andrew

2015-01-19 5:59 GMT-08:00 Romi Kuntsman r...@totango.com:

 in yarn-client mode it only controls the environment of the executor
 launcher

 So you either use yarn-client mode, and then your app keeps running and
 controlling the process
 Or you use yarn-cluster mode, and then you send a jar to YARN, and that jar
 should have code to report the result back to you

 *Romi Kuntsman*, *Big Data Engineer*
  http://www.totango.com

 On Thu, Jan 15, 2015 at 1:52 PM, preeze etan...@gmail.com wrote:

  From the official spark documentation
  (http://spark.apache.org/docs/1.2.0/running-on-yarn.html):
 
  In yarn-cluster mode, the Spark driver runs inside an application master
  process which is managed by YARN on the cluster, and the client can go
 away
  after initiating the application.
 
  Is there any designed way that the client connects back to the driver
  (still
  running in YARN) for collecting results at a later stage?
 
 
 
  --
  View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-client-reconnect-to-driver-in-yarn-cluster-deployment-mode-tp10122.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: not found: type LocalSparkContext

2015-01-20 Thread Will Benton
It's declared here:

  
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/LocalSparkContext.scala

I assume you're already importing LocalSparkContext, but since the test classes 
aren't included in Spark packages, you'll also need to package them up in order 
to use them in your application (viz., outside of Spark).



best,
wb

- Original Message -
 From: James alcaid1...@gmail.com
 To: dev@spark.apache.org
 Sent: Tuesday, January 20, 2015 6:35:07 AM
 Subject: not found: type LocalSparkContext
 
 Hi all,
 
 When I was trying to write a test on my spark application I met
 
 ```
 Error:(14, 43) not found: type LocalSparkContext
 class HyperANFSuite extends FunSuite with LocalSparkContext {
 ```
 
 At the source code of spark-core I could not found LocalSparkContext,
 thus I wonder how to write a test like [this] (
 https://github.com/apache/spark/blob/master/graphx/src/test/scala/org/apache/spark/graphx/lib/ConnectedComponentsSuite.scala
 )
 
 Alcaid
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spectral clustering

2015-01-20 Thread Xiangrui Meng
Fan and Stephen (cc'ed) are working on this feature. They will update
the JIRA page and report progress soon. -Xiangrui

On Fri, Jan 16, 2015 at 12:04 PM, Andrew Musselman
andrew.mussel...@gmail.com wrote:
 Hi, thinking of picking up this Jira ticket:
 https://issues.apache.org/jira/browse/SPARK-4259

 Anyone done any work on this to date?  Any thoughts on it before we go too
 far in?

 Thanks!

 Best
 Andrew

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spectral clustering

2015-01-20 Thread Andrew Musselman
Awesome, thanks

On Tue, Jan 20, 2015 at 12:56 PM, Xiangrui Meng men...@gmail.com wrote:

 Fan and Stephen (cc'ed) are working on this feature. They will update
 the JIRA page and report progress soon. -Xiangrui

 On Fri, Jan 16, 2015 at 12:04 PM, Andrew Musselman
 andrew.mussel...@gmail.com wrote:
  Hi, thinking of picking up this Jira ticket:
  https://issues.apache.org/jira/browse/SPARK-4259
 
  Anyone done any work on this to date?  Any thoughts on it before we go
 too
  far in?
 
  Thanks!
 
  Best
  Andrew



Standardized Spark dev environment

2015-01-20 Thread Nicholas Chammas
What do y'all think of creating a standardized Spark development
environment, perhaps encoded as a Vagrantfile, and publishing it under
`dev/`?

The goal would be to make it easier for new developers to get started with
all the right configs and tools pre-installed.

If we use something like Vagrant, we may even be able to make it so that a
single Vagrantfile creates equivalent development environments across OS X,
Linux, and Windows, without having to do much (or any) OS-specific work.

I imagine for committers and regular contributors, this exercise may seem
pointless, since y'all are probably already very comfortable with your
workflow.

I wonder, though, if any of you think this would be worthwhile as a
improvement to the new Spark developer experience.

Nick


Re: Standardized Spark dev environment

2015-01-20 Thread shenyan zhen
Great suggestion.
On Jan 20, 2015 7:14 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 What do y'all think of creating a standardized Spark development
 environment, perhaps encoded as a Vagrantfile, and publishing it under
 `dev/`?

 The goal would be to make it easier for new developers to get started with
 all the right configs and tools pre-installed.

 If we use something like Vagrant, we may even be able to make it so that a
 single Vagrantfile creates equivalent development environments across OS X,
 Linux, and Windows, without having to do much (or any) OS-specific work.

 I imagine for committers and regular contributors, this exercise may seem
 pointless, since y'all are probably already very comfortable with your
 workflow.

 I wonder, though, if any of you think this would be worthwhile as a
 improvement to the new Spark developer experience.

 Nick



Re: Standardized Spark dev environment

2015-01-20 Thread Ted Yu
How many profiles (hadoop / hive /scala) would this development environment
support ?

Cheers

On Tue, Jan 20, 2015 at 4:13 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 What do y'all think of creating a standardized Spark development
 environment, perhaps encoded as a Vagrantfile, and publishing it under
 `dev/`?

 The goal would be to make it easier for new developers to get started with
 all the right configs and tools pre-installed.

 If we use something like Vagrant, we may even be able to make it so that a
 single Vagrantfile creates equivalent development environments across OS X,
 Linux, and Windows, without having to do much (or any) OS-specific work.

 I imagine for committers and regular contributors, this exercise may seem
 pointless, since y'all are probably already very comfortable with your
 workflow.

 I wonder, though, if any of you think this would be worthwhile as a
 improvement to the new Spark developer experience.

 Nick



Re: Standardized Spark dev environment

2015-01-20 Thread Nicholas Chammas
How many profiles (hadoop / hive /scala) would this development environment
support ?

As many as we want. We probably want to cover a good chunk of the build
matrix https://issues.apache.org/jira/browse/SPARK-2004 that Spark
officially supports.

What does this provide, concretely?

It provides a reliable way to create a “good” Spark development
environment. Roughly speaking, this probably should mean an environment
that matches Jenkins, since that’s where we run “official” testing and
builds.

For example, Spark has to run on Java 6 and Python 2.6. When devs build and
run Spark locally, we can make sure they’re doing it on these versions of
the languages with a simple vagrant up.

Nate, could you comment on how something like this would relate to the
Bigtop effort?

http://chapeau.freevariable.com/2014/08/jvm-test-docker.html

Will, that’s pretty sweet. I tried something similar a few months ago as an
experiment to try building/testing Spark within a container. Here’s the
shell script I used https://gist.github.com/nchammas/60b04141f3b9f053faaa
against the base CentOS Docker image to setup an environment ready to build
and test Spark.

We want to run Spark unit tests within containers on Jenkins, so it might
make sense to develop a single Docker image that can be used as both a “dev
environment” as well as execution container on Jenkins.

Perhaps that’s the approach to take instead of looking into Vagrant.

Nick

On Tue Jan 20 2015 at 8:22:41 PM Will Benton wi...@redhat.com wrote:

Hey Nick,

 I did something similar with a Docker image last summer; I haven't updated
 the images to cache the dependencies for the current Spark master, but it
 would be trivial to do so:

 http://chapeau.freevariable.com/2014/08/jvm-test-docker.html


 best,
 wb


 - Original Message -
  From: Nicholas Chammas nicholas.cham...@gmail.com
  To: Spark dev list dev@spark.apache.org
  Sent: Tuesday, January 20, 2015 6:13:31 PM
  Subject: Standardized Spark dev environment
 
  What do y'all think of creating a standardized Spark development
  environment, perhaps encoded as a Vagrantfile, and publishing it under
  `dev/`?
 
  The goal would be to make it easier for new developers to get started
 with
  all the right configs and tools pre-installed.
 
  If we use something like Vagrant, we may even be able to make it so that
 a
  single Vagrantfile creates equivalent development environments across OS
 X,
  Linux, and Windows, without having to do much (or any) OS-specific work.
 
  I imagine for committers and regular contributors, this exercise may seem
  pointless, since y'all are probably already very comfortable with your
  workflow.
 
  I wonder, though, if any of you think this would be worthwhile as a
  improvement to the new Spark developer experience.
 
  Nick
 

​


Re: Standardized Spark dev environment

2015-01-20 Thread jay vyas
I can comment on both...  hi will and nate :)

1) Will's Dockerfile solution is  the most  simple direct solution to the
dev environment question : its a  efficient way to build and develop spark
environments for dev/test..  It would be cool to put that Dockerfile
(and/or maybe a shell script which uses it) in the top level of spark as
the build entry point.  For total platform portability, u could wrap in a
vagrantfile to launch a lightweight vm, so that windows worked equally
well.

2) However, since nate mentioned  vagrant and bigtop, i have to chime in :)
the vagrant recipes in bigtop are a nice reference deployment of how to
deploy spark in a heterogenous hadoop style environment, and tighter
integration testing w/ bigtop for spark releases would be lovely !  The
vagrant stuff use puppet to deploy an n node VM or docker based cluster, in
which users can easily select components (including
spark,yarn,hbase,hadoop,etc...) by simnply editing a YAML file :
https://github.com/apache/bigtop/blob/master/bigtop-deploy/vm/vagrant-puppet/vagrantconfig.yaml
As nate said, it would be alot of fun to get more cross collaboration
between the spark and bigtop communities.   Input on how we can better
integrate spark (wether its spork, hbase integration, smoke tests aroudn
the mllib stuff, or whatever, is always welcome )






On Tue, Jan 20, 2015 at 10:21 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 How many profiles (hadoop / hive /scala) would this development environment
 support ?

 As many as we want. We probably want to cover a good chunk of the build
 matrix https://issues.apache.org/jira/browse/SPARK-2004 that Spark
 officially supports.

 What does this provide, concretely?

 It provides a reliable way to create a “good” Spark development
 environment. Roughly speaking, this probably should mean an environment
 that matches Jenkins, since that’s where we run “official” testing and
 builds.

 For example, Spark has to run on Java 6 and Python 2.6. When devs build and
 run Spark locally, we can make sure they’re doing it on these versions of
 the languages with a simple vagrant up.

 Nate, could you comment on how something like this would relate to the
 Bigtop effort?

 http://chapeau.freevariable.com/2014/08/jvm-test-docker.html

 Will, that’s pretty sweet. I tried something similar a few months ago as an
 experiment to try building/testing Spark within a container. Here’s the
 shell script I used https://gist.github.com/nchammas/60b04141f3b9f053faaa
 
 against the base CentOS Docker image to setup an environment ready to build
 and test Spark.

 We want to run Spark unit tests within containers on Jenkins, so it might
 make sense to develop a single Docker image that can be used as both a “dev
 environment” as well as execution container on Jenkins.

 Perhaps that’s the approach to take instead of looking into Vagrant.

 Nick

 On Tue Jan 20 2015 at 8:22:41 PM Will Benton wi...@redhat.com wrote:

 Hey Nick,
 
  I did something similar with a Docker image last summer; I haven't
 updated
  the images to cache the dependencies for the current Spark master, but it
  would be trivial to do so:
 
  http://chapeau.freevariable.com/2014/08/jvm-test-docker.html
 
 
  best,
  wb
 
 
  - Original Message -
   From: Nicholas Chammas nicholas.cham...@gmail.com
   To: Spark dev list dev@spark.apache.org
   Sent: Tuesday, January 20, 2015 6:13:31 PM
   Subject: Standardized Spark dev environment
  
   What do y'all think of creating a standardized Spark development
   environment, perhaps encoded as a Vagrantfile, and publishing it under
   `dev/`?
  
   The goal would be to make it easier for new developers to get started
  with
   all the right configs and tools pre-installed.
  
   If we use something like Vagrant, we may even be able to make it so
 that
  a
   single Vagrantfile creates equivalent development environments across
 OS
  X,
   Linux, and Windows, without having to do much (or any) OS-specific
 work.
  
   I imagine for committers and regular contributors, this exercise may
 seem
   pointless, since y'all are probably already very comfortable with your
   workflow.
  
   I wonder, though, if any of you think this would be worthwhile as a
   improvement to the new Spark developer experience.
  
   Nick
  
 
 ​




-- 
jay vyas


RE: Standardized Spark dev environment

2015-01-20 Thread nate
If there is some interest in more standardization and setup of dev/test 
environments spark community might be interested in starting to participate in 
apache bigtop effort:

http://bigtop.apache.org/

While the project had its start and initial focus on packaging, testing, 
deploying Hadoop/hdfs related stack its looking like we will be targeting data 
engineers going forward, thus spark is looking to become bigger central piece 
to bigtop effort as the project moves towards a v1 release.

We will be doing a bigtop/bigdata workshop late Feb at the SocalLinux 
Conference:

http://www.socallinuxexpo.org/scale/13x

Right now scoping some content that will be getting started spark related for 
the event, targeted intro of bigtop/spark puppet powered deployment components 
going into the event as well.

Also the group will be holding a meetup at Amazon's Palo Alto office on Jan 
27th if any folks are interested.

Nate

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Tuesday, January 20, 2015 5:09 PM
To: Nicholas Chammas
Cc: dev
Subject: Re: Standardized Spark dev environment

My concern would mostly be maintenance. It adds to an already very complex 
build. It only assists developers who are a small audience. What does this 
provide, concretely?
On Jan 21, 2015 12:14 AM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 What do y'all think of creating a standardized Spark development 
 environment, perhaps encoded as a Vagrantfile, and publishing it under 
 `dev/`?

 The goal would be to make it easier for new developers to get started 
 with all the right configs and tools pre-installed.

 If we use something like Vagrant, we may even be able to make it so 
 that a single Vagrantfile creates equivalent development environments 
 across OS X, Linux, and Windows, without having to do much (or any) 
 OS-specific work.

 I imagine for committers and regular contributors, this exercise may 
 seem pointless, since y'all are probably already very comfortable with 
 your workflow.

 I wonder, though, if any of you think this would be worthwhile as a 
 improvement to the new Spark developer experience.

 Nick



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Standardized Spark dev environment

2015-01-20 Thread Will Benton
Hey Nick,

I did something similar with a Docker image last summer; I haven't updated the 
images to cache the dependencies for the current Spark master, but it would be 
trivial to do so:

http://chapeau.freevariable.com/2014/08/jvm-test-docker.html


best,
wb


- Original Message -
 From: Nicholas Chammas nicholas.cham...@gmail.com
 To: Spark dev list dev@spark.apache.org
 Sent: Tuesday, January 20, 2015 6:13:31 PM
 Subject: Standardized Spark dev environment
 
 What do y'all think of creating a standardized Spark development
 environment, perhaps encoded as a Vagrantfile, and publishing it under
 `dev/`?
 
 The goal would be to make it easier for new developers to get started with
 all the right configs and tools pre-installed.
 
 If we use something like Vagrant, we may even be able to make it so that a
 single Vagrantfile creates equivalent development environments across OS X,
 Linux, and Windows, without having to do much (or any) OS-specific work.
 
 I imagine for committers and regular contributors, this exercise may seem
 pointless, since y'all are probably already very comfortable with your
 workflow.
 
 I wonder, though, if any of you think this would be worthwhile as a
 improvement to the new Spark developer experience.
 
 Nick
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: not found: type LocalSparkContext

2015-01-20 Thread James
I could not correctly import org.apache.spark.LocalSparkContext,

I use sbt on Intellij for developing,here is my build sbt.

```
libraryDependencies += org.apache.spark %% spark-core % 1.2.0

libraryDependencies += org.apache.spark %% spark-graphx % 1.2.0

libraryDependencies += com.clearspring.analytics % stream % 2.7.0

libraryDependencies += org.scalatest % scalatest_2.10 % 2.0

resolvers += Akka Repository at http://repo.akka.io/releases/;
```

I think maybe I have make some mistakes on the library setting, as a new
developer of spark application, I wonder what is the standard procedure of
developing a spark application.

Any reply is appreciated.


Alcaid


2015-01-21 2:05 GMT+08:00 Will Benton wi...@redhat.com:

 It's declared here:


 https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/LocalSparkContext.scala

 I assume you're already importing LocalSparkContext, but since the test
 classes aren't included in Spark packages, you'll also need to package them
 up in order to use them in your application (viz., outside of Spark).



 best,
 wb

 - Original Message -
  From: James alcaid1...@gmail.com
  To: dev@spark.apache.org
  Sent: Tuesday, January 20, 2015 6:35:07 AM
  Subject: not found: type LocalSparkContext
 
  Hi all,
 
  When I was trying to write a test on my spark application I met
 
  ```
  Error:(14, 43) not found: type LocalSparkContext
  class HyperANFSuite extends FunSuite with LocalSparkContext {
  ```
 
  At the source code of spark-core I could not found LocalSparkContext,
  thus I wonder how to write a test like [this] (
 
 https://github.com/apache/spark/blob/master/graphx/src/test/scala/org/apache/spark/graphx/lib/ConnectedComponentsSuite.scala
  )
 
  Alcaid
 



Re: GraphX ShortestPaths backwards?

2015-01-20 Thread Michael Malak
I created https://issues.apache.org/jira/browse/SPARK-5343 for this.


- Original Message -
From: Michael Malak michaelma...@yahoo.com
To: dev@spark.apache.org dev@spark.apache.org
Cc: 
Sent: Monday, January 19, 2015 5:09 PM
Subject: GraphX ShortestPaths backwards?

GraphX ShortestPaths seems to be following edges backwards instead of forwards:

import org.apache.spark.graphx._
val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), 
sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L,

lib.ShortestPaths.run(g,Array(3)).vertices.collect
res1: Array[(org.apache.spark.graphx.VertexId, 
org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 
- 0)), (2,Map()))

lib.ShortestPaths.run(g,Array(1)).vertices.collect

res2: Array[(org.apache.spark.graphx.VertexId, 
org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 - 0)), 
(3,Map(1 - 2)), (2,Map(1 - 1)))

If I am not mistaken about my assessment, then I believe the following changes 
will make it run forward:

Change one occurrence of src to dst in
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L64

Change three occurrences of dst to src in
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L65

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: not found: type LocalSparkContext

2015-01-20 Thread Reynold Xin
You don't need the LocalSparkContext. It is only for Spark's own unit test.

You can just create a SparkContext and use it in your unit tests, e.g.

val sc = new SparkContext(local, my test app, new SparkConf)

On Tue, Jan 20, 2015 at 7:27 PM, James alcaid1...@gmail.com wrote:

 I could not correctly import org.apache.spark.LocalSparkContext,

 I use sbt on Intellij for developing,here is my build sbt.

 ```
 libraryDependencies += org.apache.spark %% spark-core % 1.2.0

 libraryDependencies += org.apache.spark %% spark-graphx % 1.2.0

 libraryDependencies += com.clearspring.analytics % stream % 2.7.0

 libraryDependencies += org.scalatest % scalatest_2.10 % 2.0

 resolvers += Akka Repository at http://repo.akka.io/releases/;
 ```

 I think maybe I have make some mistakes on the library setting, as a new
 developer of spark application, I wonder what is the standard procedure of
 developing a spark application.

 Any reply is appreciated.


 Alcaid


 2015-01-21 2:05 GMT+08:00 Will Benton wi...@redhat.com:

  It's declared here:
 
 
 
 https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/LocalSparkContext.scala
 
  I assume you're already importing LocalSparkContext, but since the test
  classes aren't included in Spark packages, you'll also need to package
 them
  up in order to use them in your application (viz., outside of Spark).
 
 
 
  best,
  wb
 
  - Original Message -
   From: James alcaid1...@gmail.com
   To: dev@spark.apache.org
   Sent: Tuesday, January 20, 2015 6:35:07 AM
   Subject: not found: type LocalSparkContext
  
   Hi all,
  
   When I was trying to write a test on my spark application I met
  
   ```
   Error:(14, 43) not found: type LocalSparkContext
   class HyperANFSuite extends FunSuite with LocalSparkContext {
   ```
  
   At the source code of spark-core I could not found LocalSparkContext,
   thus I wonder how to write a test like [this] (
  
 
 https://github.com/apache/spark/blob/master/graphx/src/test/scala/org/apache/spark/graphx/lib/ConnectedComponentsSuite.scala
   )
  
   Alcaid
  
 



Re: Standardized Spark dev environment

2015-01-20 Thread Patrick Wendell
To respond to the original suggestion by Nick. I always thought it
would be useful to have a Docker image on which we run the tests and
build releases, so that we could have a consistent environment that
other packagers or people trying to exhaustively run Spark tests could
replicate (or at least look at) to understand exactly how we recommend
building Spark. Sean - do you think that is too high of overhead?

In terms of providing images that we encourage as standard deployment
images of Spark and want to make portable across environments, that's
a much larger project and one with higher associated maintenance
overhead. So I'd be interested in seeing that evolve as its own
project (spark-deploy) or something associated with bigtop, etc.

- Patrick

On Tue, Jan 20, 2015 at 10:30 PM, Paolo Platter
paolo.plat...@agilelab.it wrote:
 Hi all,
 I also tried the docker way and it works well.
 I suggest to look at sequenceiq/spark dockers, they are very active on that 
 field.

 Paolo

 Inviata dal mio Windows Phone
 
 Da: jay vyasmailto:jayunit100.apa...@gmail.com
 Inviato: 21/01/2015 04:45
 A: Nicholas Chammasmailto:nicholas.cham...@gmail.com
 Cc: Will Bentonmailto:wi...@redhat.com; Spark dev 
 listmailto:dev@spark.apache.org
 Oggetto: Re: Standardized Spark dev environment

 I can comment on both...  hi will and nate :)

 1) Will's Dockerfile solution is  the most  simple direct solution to the
 dev environment question : its a  efficient way to build and develop spark
 environments for dev/test..  It would be cool to put that Dockerfile
 (and/or maybe a shell script which uses it) in the top level of spark as
 the build entry point.  For total platform portability, u could wrap in a
 vagrantfile to launch a lightweight vm, so that windows worked equally
 well.

 2) However, since nate mentioned  vagrant and bigtop, i have to chime in :)
 the vagrant recipes in bigtop are a nice reference deployment of how to
 deploy spark in a heterogenous hadoop style environment, and tighter
 integration testing w/ bigtop for spark releases would be lovely !  The
 vagrant stuff use puppet to deploy an n node VM or docker based cluster, in
 which users can easily select components (including
 spark,yarn,hbase,hadoop,etc...) by simnply editing a YAML file :
 https://github.com/apache/bigtop/blob/master/bigtop-deploy/vm/vagrant-puppet/vagrantconfig.yaml
 As nate said, it would be alot of fun to get more cross collaboration
 between the spark and bigtop communities.   Input on how we can better
 integrate spark (wether its spork, hbase integration, smoke tests aroudn
 the mllib stuff, or whatever, is always welcome )






 On Tue, Jan 20, 2015 at 10:21 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 How many profiles (hadoop / hive /scala) would this development environment
 support ?

 As many as we want. We probably want to cover a good chunk of the build
 matrix https://issues.apache.org/jira/browse/SPARK-2004 that Spark
 officially supports.

 What does this provide, concretely?

 It provides a reliable way to create a good Spark development
 environment. Roughly speaking, this probably should mean an environment
 that matches Jenkins, since that's where we run official testing and
 builds.

 For example, Spark has to run on Java 6 and Python 2.6. When devs build and
 run Spark locally, we can make sure they're doing it on these versions of
 the languages with a simple vagrant up.

 Nate, could you comment on how something like this would relate to the
 Bigtop effort?

 http://chapeau.freevariable.com/2014/08/jvm-test-docker.html

 Will, that's pretty sweet. I tried something similar a few months ago as an
 experiment to try building/testing Spark within a container. Here's the
 shell script I used https://gist.github.com/nchammas/60b04141f3b9f053faaa
 
 against the base CentOS Docker image to setup an environment ready to build
 and test Spark.

 We want to run Spark unit tests within containers on Jenkins, so it might
 make sense to develop a single Docker image that can be used as both a dev
 environment as well as execution container on Jenkins.

 Perhaps that's the approach to take instead of looking into Vagrant.

 Nick

 On Tue Jan 20 2015 at 8:22:41 PM Will Benton wi...@redhat.com wrote:

 Hey Nick,
 
  I did something similar with a Docker image last summer; I haven't
 updated
  the images to cache the dependencies for the current Spark master, but it
  would be trivial to do so:
 
  http://chapeau.freevariable.com/2014/08/jvm-test-docker.html
 
 
  best,
  wb
 
 
  - Original Message -
   From: Nicholas Chammas nicholas.cham...@gmail.com
   To: Spark dev list dev@spark.apache.org
   Sent: Tuesday, January 20, 2015 6:13:31 PM
   Subject: Standardized Spark dev environment
  
   What do y'all think of creating a standardized Spark development
   environment, perhaps encoded as a Vagrantfile, and publishing it under
   `dev/`?
  
   The goal would be to make it 

KNN for large data set

2015-01-20 Thread DEVAN M.S.
Hi all,

Please help me to find out best way for K-nearest neighbor using spark for
large data sets.


R: Standardized Spark dev environment

2015-01-20 Thread Paolo Platter
Hi all,
I also tried the docker way and it works well.
I suggest to look at sequenceiq/spark dockers, they are very active on that 
field.

Paolo

Inviata dal mio Windows Phone

Da: jay vyasmailto:jayunit100.apa...@gmail.com
Inviato: ‎21/‎01/‎2015 04:45
A: Nicholas Chammasmailto:nicholas.cham...@gmail.com
Cc: Will Bentonmailto:wi...@redhat.com; Spark dev 
listmailto:dev@spark.apache.org
Oggetto: Re: Standardized Spark dev environment

I can comment on both...  hi will and nate :)

1) Will's Dockerfile solution is  the most  simple direct solution to the
dev environment question : its a  efficient way to build and develop spark
environments for dev/test..  It would be cool to put that Dockerfile
(and/or maybe a shell script which uses it) in the top level of spark as
the build entry point.  For total platform portability, u could wrap in a
vagrantfile to launch a lightweight vm, so that windows worked equally
well.

2) However, since nate mentioned  vagrant and bigtop, i have to chime in :)
the vagrant recipes in bigtop are a nice reference deployment of how to
deploy spark in a heterogenous hadoop style environment, and tighter
integration testing w/ bigtop for spark releases would be lovely !  The
vagrant stuff use puppet to deploy an n node VM or docker based cluster, in
which users can easily select components (including
spark,yarn,hbase,hadoop,etc...) by simnply editing a YAML file :
https://github.com/apache/bigtop/blob/master/bigtop-deploy/vm/vagrant-puppet/vagrantconfig.yaml
As nate said, it would be alot of fun to get more cross collaboration
between the spark and bigtop communities.   Input on how we can better
integrate spark (wether its spork, hbase integration, smoke tests aroudn
the mllib stuff, or whatever, is always welcome )






On Tue, Jan 20, 2015 at 10:21 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 How many profiles (hadoop / hive /scala) would this development environment
 support ?

 As many as we want. We probably want to cover a good chunk of the build
 matrix https://issues.apache.org/jira/browse/SPARK-2004 that Spark
 officially supports.

 What does this provide, concretely?

 It provides a reliable way to create a “good” Spark development
 environment. Roughly speaking, this probably should mean an environment
 that matches Jenkins, since that’s where we run “official” testing and
 builds.

 For example, Spark has to run on Java 6 and Python 2.6. When devs build and
 run Spark locally, we can make sure they’re doing it on these versions of
 the languages with a simple vagrant up.

 Nate, could you comment on how something like this would relate to the
 Bigtop effort?

 http://chapeau.freevariable.com/2014/08/jvm-test-docker.html

 Will, that’s pretty sweet. I tried something similar a few months ago as an
 experiment to try building/testing Spark within a container. Here’s the
 shell script I used https://gist.github.com/nchammas/60b04141f3b9f053faaa
 
 against the base CentOS Docker image to setup an environment ready to build
 and test Spark.

 We want to run Spark unit tests within containers on Jenkins, so it might
 make sense to develop a single Docker image that can be used as both a “dev
 environment” as well as execution container on Jenkins.

 Perhaps that’s the approach to take instead of looking into Vagrant.

 Nick

 On Tue Jan 20 2015 at 8:22:41 PM Will Benton wi...@redhat.com wrote:

 Hey Nick,
 
  I did something similar with a Docker image last summer; I haven't
 updated
  the images to cache the dependencies for the current Spark master, but it
  would be trivial to do so:
 
  http://chapeau.freevariable.com/2014/08/jvm-test-docker.html
 
 
  best,
  wb
 
 
  - Original Message -
   From: Nicholas Chammas nicholas.cham...@gmail.com
   To: Spark dev list dev@spark.apache.org
   Sent: Tuesday, January 20, 2015 6:13:31 PM
   Subject: Standardized Spark dev environment
  
   What do y'all think of creating a standardized Spark development
   environment, perhaps encoded as a Vagrantfile, and publishing it under
   `dev/`?
  
   The goal would be to make it easier for new developers to get started
  with
   all the right configs and tools pre-installed.
  
   If we use something like Vagrant, we may even be able to make it so
 that
  a
   single Vagrantfile creates equivalent development environments across
 OS
  X,
   Linux, and Windows, without having to do much (or any) OS-specific
 work.
  
   I imagine for committers and regular contributors, this exercise may
 seem
   pointless, since y'all are probably already very comfortable with your
   workflow.
  
   I wonder, though, if any of you think this would be worthwhile as a
   improvement to the new Spark developer experience.
  
   Nick
  
 
 ​




--
jay vyas


Re: Is there any way to support multiple users executing SQL on thrift server?

2015-01-20 Thread Cheng Lian

Hey Yi,

I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would 
like to investigate this issue later. Would you please open an JIRA for 
it? Thanks!


Cheng

On 1/19/15 1:00 AM, Yi Tian wrote:


Is there any way to support multiple users executing SQL on one thrift 
server?


I think there are some problems for spark 1.2.0, for example:

 1. Start thrift server with user A
 2. Connect to thrift server via beeline with user B
 3. Execute “insert into table dest select … from table src”

then we found these items on hdfs:

|drwxr-xr-x   - B supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1
drwxr-xr-x   - B supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary
drwxr-xr-x   - B supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0
drwxr-xr-x   - A supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/_temporary
drwxr-xr-x   - A supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/task_201501161642_0022_m_00
-rw-r--r--   3 A supergroup   2671 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/task_201501161642_0022_m_00/part-0
|

You can see all the temporary path created on driver side (thrift 
server side) is owned by user B (which is what we expected).


But all the output data created on executor side is owned by user A, 
(which is NOT what we expected).
error owner of the output data cause 
|org.apache.hadoop.security.AccessControlException| while the driver 
side moving output data into |dest| table.


Is anyone know how to resolve this problem?

​