Re: Apache Spark is not building in Mac/Java 8

2014-05-02 Thread Prashant Sharma
you will need to change sbt version to 13.2. I think spark 0.9.1 was released with sbt 13 ? Incase not then it may not work with java 8. Just wait for 1.0 release or give 1.0 release candidate a try !

RE: Apache Spark is not building in Mac/Java 8

2014-05-02 Thread N . Venkata Naga Ravi
Thanks for your quick replay. I tried with fresh installation, it downloads sbt 0.12.4 only (please check below logs). So it is not working. Can you tell where this 1.0 release candidate located which i can try? dhcp-173-39-68-28:spark-0.9.1 neravi$ ./sbt/sbt assembly Attempting to fetch sbt

Re: Apache Spark is not building in Mac/Java 8

2014-05-02 Thread Prashant Sharma
I have pasted the link in my previous post. Prashant Sharma On Fri, May 2, 2014 at 4:15 PM, N.Venkata Naga Ravi nvn_r...@hotmail.comwrote: Thanks for your quick replay. I tried with fresh installation, it downloads sbt 0.12.4 only (please check below logs). So it is not working. Can you

Re: Spark: issues with running a sbt fat jar due to akka dependencies

2014-05-02 Thread Koert Kuipers
not sure why applying concat to reference. conf didn't work for you. since it simply concatenates the files the key akka.version should be preserved. we had the same situation for a while without issues. On May 1, 2014 8:46 PM, Shivani Rao raoshiv...@gmail.com wrote: Hello Koert, That did not

Re: help me

2014-05-02 Thread Mayur Rustagi
Spark would be much faster on process_local instead of node_local. Node_local references data from local harddisk, process_local references data from in-memory thread. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue,

when to use broadcast variables

2014-05-02 Thread Diana Carroll
Anyone have any guidance on using a broadcast variable to ship data to workers vs. an RDD? Like, say I'm joining web logs in an RDD with user account data. I could keep the account data in an RDD or if it's small, a broadcast variable instead. How small is small? Small enough that I know it

Re: when to use broadcast variables

2014-05-02 Thread Prashant Sharma
I had like to be corrected on this but I am just trying to say small enough of the order of few 100 MBs. Imagine the size gets shipped to all nodes, it can be a GB but not GBs and then depends on the network too. Prashant Sharma On Fri, May 2, 2014 at 6:42 PM, Diana Carroll

RE: Apache Spark is not building in Mac/Java 8

2014-05-02 Thread N . Venkata Naga Ravi
Thanks Prashant . The 1.0 RC version is working fine in my system. Let me explore further and get back you. Thanks Again, Ravi From: scrapco...@gmail.com Date: Fri, 2 May 2014 16:22:40 +0530 Subject: Re: Apache Spark is not building in Mac/Java 8 To: user@spark.apache.org I have pasted the link

Re: Opinions stratosphere

2014-05-02 Thread Philip Ogren
Great reference! I just skimmed through the results without reading much of the methodology - but it looks like Spark outperforms Stratosphere fairly consistently in the experiments. It's too bad the data sources only range from 2GB to 8GB. Who knows if the apparent pattern would extend out

Re: Opinions stratosphere

2014-05-02 Thread Michael Malak
looks like Spark outperforms Stratosphere fairly consistently in the experiments There was one exception the paper noted, which was when memory resources were constrained. In that case, Stratosphere seemed to have degraded more gracefully than Spark, but the author did not explore it deeper.

Re: Task not serializable: collect, take

2014-05-02 Thread SK
Thank you very much. Making the trait serializable worked. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Task-not-serializable-collect-take-tp5193p5236.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-02 Thread Andrew Lee
Hi All, I encountered this problem when the firewall is enabled between the spark-shell and the Workers. When I launch spark-shell in yarn-client mode, I notice that Workers on the YARN containers are trying to talk to the driver (spark-shell), however, the firewall is not opened and caused

Re: Equally weighted partitions in Spark

2014-05-02 Thread Andrew Ash
Deenar, I haven't heard of any activity to do partitioning in that way, but it does seem more broadly valuable. On Fri, May 2, 2014 at 10:15 AM, deenar.toraskar deenar.toras...@db.comwrote: I have equal sized partitions now, but I want the RDD to be partitioned such that the partitions are

GraphX vertices and connected edges

2014-05-02 Thread Kyle Ellrott
What is the most efficient way to an RDD of GraphX vertices and their connected edges? Initially I though I could use mapReduceTriplet, but I realized that would neglect vertices that aren't connected to anything Would I have to do a mapReduceTriplet and then do a join with all of the vertices to

Re: Spark: issues with running a sbt fat jar due to akka dependencies

2014-05-02 Thread Shivani Rao
Hello Stephen, My goal was to run spark on a cluster that already had spark and hadoop installed. So the right thing to do was to remove these dependencies in my spark build. I wrote a bloghttp://myresearchdiaries.blogspot.com/2014/05/building-apache-spark-jars.html about it so that it might

Re: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-02 Thread Yana Kadiyska
I think what you want to do is set spark.driver.port to a fixed port. On Fri, May 2, 2014 at 1:52 PM, Andrew Lee alee...@hotmail.com wrote: Hi All, I encountered this problem when the firewall is enabled between the spark-shell and the Workers. When I launch spark-shell in yarn-client

Re: is it possible to initiate Spark jobs from Oozie?

2014-05-02 Thread Shivani Rao
I have mucked around this a little bit. The first step to make this happen is to build a fat jar. I wrote a quick bloghttp://myresearchdiaries.blogspot.com/2014/05/building-apache-spark-jars.htmldocumenting my learning curve w.r.t that. The next step is to schedule this as a java action. Since

Re: another updateStateByKey question

2014-05-02 Thread Tathagata Das
Could be a bug. Can you share a code with data that I can use to reproduce this? TD On May 2, 2014 9:49 AM, Adrian Mocanu amoc...@verticalscope.com wrote: Has anyone else noticed that *sometimes* the same tuple calls update state function twice? I have 2 tuples with the same key in 1 RDD

Re: java.lang.ClassNotFoundException - spark on mesos

2014-05-02 Thread bo...@shopify.com
I have opened a PR for discussion on the apache/spark repository https://github.com/apache/spark/pull/620 There is certainly a classLoader problem in the way Mesos and Spark operate, I'm not sure what caused it to suddenly stop working so I'd like to open the discussion there -- View this

RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-02 Thread Andrew Lee
Hi Yana, I did. I configured the the port in spark-env.sh, the problem is not the driver port which is fixed.it's the Workers port that are dynamic every time when they are launched in the YARN container. :-( Any idea how to restrict the 'Workers' port range? Date: Fri, 2 May 2014 14:49:23

Re: Incredible slow iterative computation

2014-05-02 Thread Andrew Ash
If you end up with a really long dependency tree between RDDs (like 100+) people have reported success with using the .checkpoint() method. This computes the RDD and then saves it, flattening the dependency tree. It turns out that having a really long RDD dependency graph causes serialization

Invoke spark-shell without attempting to start the http server

2014-05-02 Thread Stephen Boesch
We have a spark server already running. When invoking spark-shell a new http server is attempted to be started spark.HttpServer: Starting HTTP Server But that attempts results in a BindException due to the preexisting server: java.net.BindException: Address already in use What is the

RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-02 Thread Jacob Eisinger
Howdy Andrew, I think I am running into the same issue [1] as you. It appears that Spark opens up dynamic / ephemera [2] ports for each job on the shell and the workers. As you are finding out, this makes securing and managing the network for Spark very difficult. Any idea how to restrict

Re: GraphX vertices and connected edges

2014-05-02 Thread Ankur Dave
Do you mean you want to obtain a list of adjacent edges for every vertex? A mapReduceTriplets followed by a join is the right way to do this. The join will be cheap because the original and derived vertices will share indices. There's a built-in function to do this for neighboring vertex

docker image build issue for spark 0.9.1

2014-05-02 Thread Weide Zhang
Hi I tried to build docker image for spark 0.9.1 but get the following error. any one has experience resolving the issue ? The following packages have unmet dependencies: tzdata-java : Depends: tzdata (= 2012b-1) but 2013g-0ubuntu0.12.04 is to be installed E: Unable to correct problems, you

Re: docker image build issue for spark 0.9.1

2014-05-02 Thread Weide Zhang
yes, the docker script is there inside spark source package. It already specifies the master and worker container to run in different docker containers. Mainly it is used for easy deployment and development in my scenario. On Fri, May 2, 2014 at 2:30 PM, Nicholas Chammas

Seattle Spark Meetup Slides

2014-05-02 Thread Denny Lee
We’ve had some pretty awesome presentations at the Seattle Spark Meetup - here are the links to the various slides: Seattle Spark Meetup KickOff with DataBricks | Introduction to Spark with Matei Zaharia and Pat McDonough Learnings from Running Spark at Twitter sessions Ben Hindman’s Mesos

spark 0.9.1: ClassNotFoundException

2014-05-02 Thread SK
I am using Spark 0.9.1 in standalone mode. In the SPARK_HOME/examples/src/main/scala/org/apache/spark/ folder, I created my directory called mycode in which I have placed some standalone scala code. I was able to compile. I ran the code using: ./bin/run-example org.apache.spark.mycode.MyClass

Re: YARN issues with resourcemanager.scheduler.address

2014-05-02 Thread zsterone
ok, we figured it out. It is a bit weird, but for some reason, the YARN_CONF_DIR and HADOOP_CONF_DIR did not propagate out. We do see it in the build classpath, but the remote machines don't seem to get it. So we added: export SPARK_YARN_USER_ENV=CLASSPATH=/hadoop/var/hadoop/conf/ and it seems

Re: string to int conversion

2014-05-02 Thread DB Tsai
You can drop header in csv by rddData.mapPartitionsWithIndex((partitionIdx: Int, lines: Iterator[String]) = { if (partitionIdx == 0) { lines.drop(1) } lines } On May 2, 2014 6:02 PM, SK skrishna...@gmail.com wrote: 1) I have a csv file where one of the field has integer data but it

Reading and processing binary format files using spark

2014-05-02 Thread Chengi Liu
Hi, Lets say I have millions of binary format files... Lets say I have this java (or python) library which reads and parses these binary formatted files.. Say import foo f = foo.open(filename) header = f.get_header() and some other methods.. What I was thinking was to write hadoop input