Re: Building Spark 0.9.x for CDH5 with mrv1 installation (Protobuf 2.5 upgrade)

2014-03-26 Thread Patrick Wendell
I'm not sure exactly how your cluster is configured. But as far as I can tell Cloudera's MR1 CDH5 dependencies are against Hadoop 2.3. I'd just find the exact CDH version you have and link against the `mr1` version of their published dependencies in that version. So I think you wan't

ALS memory limits

2014-03-26 Thread Debasish Das
Hi, For our usecases we are looking into 20 x 1M matrices which comes in the similar ranges as outlined by the paper over here: http://sandeeptata.blogspot.com/2012/12/sparkler-large-scale-matrix.html Is the exponential runtime growth in spark ALS as outlined by the blog still exists in

RDD Collect returns empty arrays

2014-03-26 Thread gaganbm
I am getting strange behavior with the RDDs. All I want is to persist the RDD contents in a single file. The saveAsTextFile() saves them in multiple textfiles for each partition. So I tried with rdd.coalesce(1,true).saveAsTextFile(). This fails with the exception :

Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-26 Thread qingyang li
Egor, i encounter the same problem which you have asked in this thread: http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCAMrx5DwJVJS0g_FE7_2qwMu4Xf0y5VfV=tlyauv2kh5v4k6...@mail.gmail.com%3E have you fixed this problem? i am using shark to read a table which i have created on

Re: Shark does not give any results with SELECT count(*) command

2014-03-26 Thread qingyang li
hi, Praveen, thanks for replying. I am using hive-0.11 which comes from amplab, at the begining , the hive-site.xml of amplab is empty, so , i copy one hive-site.xml from my cluster and then remove some attributes and aslo add some atrributs. i think it is not the reason for my problem, i think

Re: Shark does not give any results with SELECT count(*) command

2014-03-26 Thread Praveen R
Oh k. You must be running shark server on bigdata001 to use it from other machines. ./bin/shark --service sharkserver # runs shark server on port 1 You could connect to shark server as ./bin/shark -h bigdata001, this should work unless there is a firewall blocking it. You might use telnet

Re: Shark does not give any results with SELECT count(*) command

2014-03-26 Thread qingyang li
hi, Praveen, I can start server on bigdata001 using /bin/shark --service sharkserver, i can also connect this server using ./bin/shark -h bigdata001 . but, the problem still there: run select count(*) from b on bigdata001, no result , no error. run select count(*) from b on bigdata002, no

Re: How to set environment variable for a spark job

2014-03-26 Thread santhoma
OK, it was working. I printed System.getenv(..) for both env variables and they gave correct values. However it did not give me the intended result. My intention was to load a native library from LD_LIBRARY_PATH, but looks like the library is loaded from value of -Djava.library.path. Value

Re: Shark does not give any results with SELECT count(*) command

2014-03-26 Thread qingyang li
i have found such log on bigdata003: 14/03/25 17:08:49 INFO network.ConnectionManager: Accepted connection from [bigdata001/192.168.1.101] 14/03/25 17:08:49 INFO network.ConnectionManager: Accepted connection from [bigdata002/192.168.1.102] 14/03/25 17:08:49 INFO network.ConnectionManager:

Re: tracking resource usage for spark-shell commands

2014-03-26 Thread Bharath Bhushan
Thanks for the response Mayur. I was seeing the webui of 0.9.0 spark. I see lots of detailed statistics in the newer 1.0.0-snapshot version. The only thing I found missing was the actual code that I had typed in at the spark-shell prompt, but I can always get it from the shell history. On

Re: Spark Streaming - Shared hashmaps

2014-03-26 Thread Tathagata Das
When you say launch long-running tasks does it mean long running Spark jobs/tasks, or long-running tasks in another system? If the rate of requests from Kafka is not low (in terms of records per second), you could collect the records in the driver, and maintain the shared bag in the driver. A

Distributed running in Spark Interactive shell

2014-03-26 Thread Sai Prasanna
Is it possible to run across cluster using Spark Interactive Shell ? To be more explicit, is the procedure similar to running standalone master-slave spark. I want to execute my code in the interactive shell in the master-node, and it should run across the cluster [say 5 node]. Is the procedure

Re: Distributed running in Spark Interactive shell

2014-03-26 Thread Nan Zhu
what do you mean by run across the cluster? you want to start the spark-shell across the cluster or you want to distribute tasks to multiple machines? if the former case, yes, as long as you indicate the right master URL if the later case, also yes, you can observe the distributed task in the

Re: Distributed running in Spark Interactive shell

2014-03-26 Thread Sai Prasanna
Nan Zhu, its the later, I want to distribute the tasks to the cluster [machines available.] If i set the SPARK_MASTER_IP at the other machines and set the slaves-IP in the /conf/slaves at the master node, will the interactive shell code run at the master get distributed across multiple machines

Re: Distributed running in Spark Interactive shell

2014-03-26 Thread Nan Zhu
what you only need to do is ensure your spark cluster is running well, (you can check by access the Spark UI to see if all workers are displayed) then, you have to set correct SPARK_MASTER_IP in the machine where you run spark-shell The more details are : when you run bin/spark-shell, it will

Re: Distributed running in Spark Interactive shell

2014-03-26 Thread Yana Kadiyska
Nan (or anyone who feels they understand the cluster architecture well), can you clarify something for me. From reading this user group and your explanation above it appears that the cluster master is only involved in this during application startup -- to allocate executors(from what you wrote

Re: Distributed running in Spark Interactive shell

2014-03-26 Thread Nan Zhu
master does more work than that actually, I just explained why he should set MASTER_IP correctly a simplified list: 1. maintain the worker status 2. maintain in-cluster driver status 3. maintain executor status (the worker tells master what happened on the executor, -- Nan Zhu On

Re: Distributed running in Spark Interactive shell

2014-03-26 Thread Nan Zhu
and, yes, I think that picture is a bit misleading, though in the following paragraph it has mentioned that “ Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the

Re: java.lang.ClassNotFoundException

2014-03-26 Thread Ognen Duzlevski
Have you looked at the individual nodes logs? Can you post a bit more of the exception's output? On 3/26/14, 8:42 AM, Jaonary Rabarisoa wrote: Hi all, I got java.lang.ClassNotFoundException even with addJar called. The jar file is present in each node. I use the version of spark from

Re: spark-shell on standalone cluster gives error no mesos in java.library.path

2014-03-26 Thread Christoph Böhm
Hi, I have a similar issue like the user below: I’m running Spark 0.8.1 (standalone). When I test the streaming NetworkWordCount example as in the docs with local[2] it works fine. As soon as I want to connect to my cluster using [NetworkWordCount master …] it says: --- Failed to load native

closures moving averages (state)

2014-03-26 Thread Adrian Mocanu
I'm passing a moving average function during the map phase like this: val average= new Sma(window=3) stream.map(x= average.addNumber(x)) where class Sma extends Serializable { .. } I also tried to put creation of object average in an object like I saw in another post: object Average {

Re: java.lang.ClassNotFoundException

2014-03-26 Thread Ognen Duzlevski
Have you looked through the logs fully? I have seen this (in my limited experience) pop up as a result of previous exceptions/errors, also as a result of being unable to serialize objects etc. Ognen On 3/26/14, 10:39 AM, Jaonary Rabarisoa wrote: I notice that I get this error when I'm trying

RE: closures moving averages (state)

2014-03-26 Thread Adrian Mocanu
Tried with reduce and it's giving me pretty weird results that make no sense ie: 1 number for an entire RDD val smaStream= inputStream.reduce{ case(t1,t2) = { val sma= average.addDataPoint(t1) sma }} Tried with transform and that worked correctly, but unfortunately, it

interleave partitions?

2014-03-26 Thread Walrus theCat
Hi, I want to do something like this: rdd3 = rdd1.coalesce(N).partitions.zip(rdd2.coalesce(N).partitions) I realize the above will get me something like Array[(partition,partition)]. I hope you see what I'm going for here -- any tips on how to accomplish this? Thanks

streaming questions

2014-03-26 Thread Diana Carroll
I'm trying to understand Spark streaming, hoping someone can help. I've kinda-sorta got a version of Word Count running, and it looks like this: import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.StreamingContext._ object StreamingWordCount { def

Re: streaming questions

2014-03-26 Thread Tathagata Das
*Answer 1:*Make sure you are using master as local[n] with n 1 (assuming you are running it in local mode). The way Spark Streaming works is that it assigns a code to the data receiver, and so if you run the program with only one core (i.e., with local or local[1]), then it wont have resources to

RE: streaming questions

2014-03-26 Thread Adrian Mocanu
Hi Diana, I'll answer Q3: You can check if an RDD is empty in several ways. Someone here mentioned that using an iterator was safer: val isEmpty = rdd.mapPartitions(iter = Iterator(! iter.hasNext)).reduce(__) You can also check with a fold or rdd.count rdd.reduce(_ + _) // can't handle

Re: interleave partitions?

2014-03-26 Thread Walrus theCat
Answering my own question here. This may not be efficient, but this is what I came up with: rdd1.coalesce(N).glom.zip(rdd2.coalesce(N).glom).map { case(x,y) = x++y} On Wed, Mar 26, 2014 at 11:11 AM, Walrus theCat walrusthe...@gmail.comwrote: Hi, I want to do something like this: rdd3 =

Re: streaming questions

2014-03-26 Thread Diana Carroll
Thanks, Tagatha, very helpful. A follow-up question below... On Wed, Mar 26, 2014 at 2:27 PM, Tathagata Das tathagata.das1...@gmail.comwrote: *Answer 3:*You can do something like wordCounts.foreachRDD((rdd: RDD[...], time: Time) = { if (rdd.take(1).size == 1) { // There exists

YARN problem using an external jar in worker nodes Inbox x

2014-03-26 Thread Sung Hwan Chung
Hello, (this is Yarn related) I'm able to load an external jar and use its classes within ApplicationMaster. I wish to use this jar within worker nodes, so I added sc.addJar(pathToJar) and ran. I get the following exception: org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 4

Re: rdd.saveAsTextFile problem

2014-03-26 Thread Tathagata Das
Can you give us the more detailed exception + stack trace in the log? It should be in the driver log. If not, please take a look at the executor logs, through the web ui to find the stack trace. TD On Tue, Mar 25, 2014 at 10:43 PM, gaganbm gagan.mis...@gmail.com wrote: Hi Folks, Is this

Re: java.lang.ClassNotFoundException

2014-03-26 Thread Jaonary Rabarisoa
it seems to be an old problem : http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3c7f6aa9e820f55d4a96946a87e086ef4a4bcdf...@eagh-erfpmbx41.erf.thomson.com%3E https://groups.google.com/forum/#!topic/spark-users/Q66UOeA2u-I Does anyone got the solution ? On Wed, Mar 26, 2014 at

Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-26 Thread Sandy Ryza
Hi Sung, Are you using yarn-standalone mode? Have you specified the --addJars option with your external jars? -Sandy On Wed, Mar 26, 2014 at 1:17 PM, Sung Hwan Chung coded...@cs.stanford.eduwrote: Hello, (this is Yarn related) I'm able to load an external jar and use its classes within

Re: java.lang.ClassNotFoundException

2014-03-26 Thread Aniket Mokashi
context.objectFile[ReIdDataSetEntry](data) -not sure how this is compiled in scala. But, if it uses some sort of ObjectInputStream, you need to be careful - ObjectInputStream uses root classloader to load classes and does not work with jars that are added to TCCC. Apache commons has

Announcing Spark SQL

2014-03-26 Thread Michael Armbrust
Hey Everyone, This already went out to the dev list, but I wanted to put a pointer here as well to a new feature we are pretty excited about for Spark 1.0. http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html Michael

All pairs shortest paths?

2014-03-26 Thread Ryan Compton
No idea how feasible this is. Has anyone done it?

Re: Announcing Spark SQL

2014-03-26 Thread Nicholas Chammas
This is so, so COOL. YES. I'm excited about using this once I'm a bit more comfortable with Spark. Nice work, people! On Wed, Mar 26, 2014 at 5:58 PM, Michael Armbrust mich...@databricks.comwrote: Hey Everyone, This already went out to the dev list, but I wanted to put a pointer here as

RE: Announcing Spark SQL

2014-03-26 Thread Bingham, Skyler
Fantastic! Although, I think they missed an obvious name choice: SparkQL (pronounced sparkle) :) Skyler From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Wednesday, March 26, 2014 3:58 PM To: user@spark.apache.org Subject: Announcing Spark SQL Hey Everyone, This already went out

Re: All pairs shortest paths?

2014-03-26 Thread Ryan Compton
To clarify: I don't need the actual paths, just the distances. On Wed, Mar 26, 2014 at 3:04 PM, Ryan Compton compton.r...@gmail.com wrote: No idea how feasible this is. Has anyone done it?

Re: coalescing RDD into equally sized partitions

2014-03-26 Thread Walrus theCat
For the record, I tried this, and it worked. On Wed, Mar 26, 2014 at 10:51 AM, Walrus theCat walrusthe...@gmail.comwrote: Oh so if I had something more reasonable, like RDD's full of tuples of say, (Int,Set,Set), I could expect a more uniform distribution? Thanks On Mon, Mar 24, 2014

Re: Announcing Spark SQL

2014-03-26 Thread Matei Zaharia
Congrats Michael co for putting this together — this is probably the neatest piece of technology added to Spark in the past few months, and it will greatly change what users can do as more data sources are added. Matei On Mar 26, 2014, at 3:22 PM, Ognen Duzlevski og...@plainvanillagames.com

Re: Announcing Spark SQL

2014-03-26 Thread Christopher Nguyen
+1 Michael, Reynold et al. This is key to some of the things we're doing. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Wed, Mar 26, 2014 at 2:58 PM, Michael Armbrust mich...@databricks.comwrote: Hey Everyone, This already went out to the

Re: All pairs shortest paths?

2014-03-26 Thread Matei Zaharia
Yeah, if you’re just worried about statistics, maybe you can do sampling (do single-pair paths from 100 random nodes and you get an idea of what percentage of nodes have what distribution of neighbors in a given distance). Matei On Mar 26, 2014, at 5:55 PM, Ryan Compton compton.r...@gmail.com

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-26 Thread Scott Clasen
The web-ui shows 3 executors, the driver and one spark task on each worker. I do see that there were 8 successful tasks and the ninth failed like so... java.lang.Exception (java.lang.Exception: Could not compute split, block input-0-1395860790200 not found)

Re: Announcing Spark SQL

2014-03-26 Thread Soumya Simanta
Very nice. Any plans to make the SQL typesafe using something like Slick ( http://slick.typesafe.com/) Thanks ! On Wed, Mar 26, 2014 at 5:58 PM, Michael Armbrust mich...@databricks.comwrote: Hey Everyone, This already went out to the dev list, but I wanted to put a pointer here as well to

Spark preferred compression format

2014-03-26 Thread Debasish Das
Hi, What's the splittable compression format that works with Spark right now ? We are looking into bzip2 / lzo / gzip...gzip is not splittable so not a good optionWithin bzip2/lzo I am confused. Thanks. Deb

Re: Announcing Spark SQL

2014-03-26 Thread Michael Armbrust
Any plans to make the SQL typesafe using something like Slick ( http://slick.typesafe.com/) I would really like to do something like that, and maybe we will in a couple of months. However, in the near term, I think the top priorities are going to be performance and stability. Michael

Not getting it

2014-03-26 Thread lannyripple
Hi all, I've got something which I think should be straightforward but it's not so I'm not getting it. I have an 8 node spark 0.9.0 cluster also running HDFS. Workers have 16g of memory using 8 cores. In HDFS I have a CSV file of 110M lines of 9 columns (e.g., [key,a,b,c...]). I have another