Re: java.io.NotSerializableException

2014-02-24 Thread leosand...@gmail.com
Which class is not Serializable? I run shark0.9 has a similarity exception: java.io.NotSerializableException (java.io.NotSerializableException: shark.execution.ReduceKeyReduceSide) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)

Shark server crashes-[Thrift Error]: java.net.SocketException: Socket closed

2014-02-24 Thread Arpit Tak
Shark-server crashes after some-times when running big queries ... Any suggestion how to get rid of it.. show tables exception: [Thrift Error]: java.net.SocketException: Socket closed [Thrift Error]: Hive server is not cleaned due to thrift exception: java.net.SocketException: Socket closed

java.lang.ClassNotFoundException

2014-02-24 Thread Terance Dias
Hi, I'm trying the spark on yarn example at https://spark.incubator.apache.org/docs/latest/running-on-yarn.html When I try to run the SparkPi example using the spark-class command, the job fails and In the stderr file of the job logs, I see the following error. java.lang.ClassNotFoundException:

Nothing happens when executing on cluster

2014-02-24 Thread Anders Bennehag
Hello there, I'm having some trouble with my spark-cluster consisting of master.censored.dev and spark-worker-0 Reading from the output of pyspark, master, and worker-node it seems like the cluster is formed correctly and pyspark connects to it. But for some reason, nothing happens after

Re: Creating a Spark context from a Scalatra servlet

2014-02-24 Thread Ognen Duzlevski
In any case, I am running the same version of spark standalone on the cluster as the jobserver (I compiled the master branch as opposed to the jobserver branch, not sure if this matters). I then proceeded to change the application.conf file to reflect the spark://master_ip:7077 as the master.

Re: Creating a Spark context from a Scalatra servlet

2014-02-24 Thread Ognen Duzlevski
Figured it out. I did sbt/sbt assembly on the same jobserver branch and am running that as a standalone spark cluster. I am then running a separate jobserver from the same branch - it all works now. Ognen On 2/24/14, 9:02 AM, Ognen Duzlevski wrote: In any case, I am running the same version

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Deepak Nulu
Hi Evan, Thanks for the quick response. The only mapping between UUIDs and Longs that I can think of is one where I sequentially assign Longs as I load the UUIDs from the DB. But this results in having to centralize this mapping. I am guessing that centralizing this is not a good idea for a

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Deepak Nulu
Hi Josh, Thanks for your quick response. Yes, it is a practical option, but my concern is the need to centralize this mapping. Please see my response to Evan's response. Thanks. -deepak -- View this message in context:

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Ewen Cheslack-Postava
You can almost certainly take half of the UUID safely, assuming you're using random UUIDs. You could work out the math if you're really concerned, but the probability of a collision in 64 bits is probably pretty low even with a very large data set. If your UUIDs aren't version 4, you probably

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Ewen Cheslack-Postava
In addition, you can easily verify there are no collisions with Spark before running anything through GraphX -- create the mapping and then groupByKey to find any keys with multiple mappings. Ewen Cheslack-Postava February 24, 2014 at 10:58 AM You can almost certainly

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Deepak Nulu
Thanks Ewen, I will look into using half the UUID (we are indeed using random (version 4) UUIDs). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-with-UUID-vertex-IDs-instead-of-Long-tp1953p1989.html Sent from the Apache Spark User List mailing list

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Deepak Nulu
Thanks Christopher, I will look into the StackOverflow suggestion of generating 64-bit UUIDs in the same fashion as 128-bit UUIDs. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-with-UUID-vertex-IDs-instead-of-Long-tp1953p1990.html Sent from the

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Deepak Nulu
A lot of great suggestions here that I am going to investigate. In parallel, I would like to explore the possibility of having GraphX be parameterized on the VertexId type. Is that a question for the developer mailing list? Thanks. -deepak -- View this message in context:

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Christopher Nguyen
Deepak, to be sure, I was referring to sequential guarantees with the longs. I would suggest being careful with taking half the UUID as the probability of collision can be unexpectedly high. Many bits of the UUID is typically time-based so collision among those bits is virtually guaranteed with

cached rdd in memory eviction

2014-02-24 Thread Koert Kuipers
i was under the impression that running jobs could not evict cached rdds from memory as long as they are below spark.storage.memoryFraction. however what i observe seems to indicate the opposite. did anything change? thanks! koert

Re: ETL on pyspark

2014-02-24 Thread Matei Zaharia
collect() means to bring all the data back to the master node, and there might just be too much of it for that. How big is your file? If you can’t bring it back to the master node try saveAsTextFile to write it out to a filesystem (in parallel). Matei On Feb 24, 2014, at 1:08 PM, Chengi Liu

Re: ETL on pyspark

2014-02-24 Thread Chengi Liu
Its around 10 GB big? All I want is to do a frequency count? And then get top 10 entries based on count? How do i do this (again on pyspark( Thanks On Mon, Feb 24, 2014 at 1:19 PM, Matei Zaharia matei.zaha...@gmail.comwrote: collect() means to bring all the data back to the master node, and

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Deepak Nulu
Thanks Christopher. I too am not comfortable with halving the random UUIDs, and thanks to your response, I don't need to do the math :-). The StackOverflow link you suggested had a different set of ideas that I am bit more comfortable with, but the one I am still hoping for is the use of UUIDs as

Re: ETL on pyspark

2014-02-24 Thread Chengi Liu
Hi, Using pyspark for the first time on realistic dataset ( few hundred GB's) but have been seeing a lot of errors on pyspark shell? This might be because maybe I am not using pyspark correctly? But here is what I was trying: extract_subs.take(2) //returns [u'867430', u'867429']

Re: ETL on pyspark

2014-02-24 Thread Matei Zaharia
Yeah, so the problem is that countByValue returns *all* values and their counts to your machine. If you just want the top 10, try this: # do a distributed count using reduceByKey counts = data.map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b) # reverse the (key, count) pairs into (count,

apparently non-critical errors running spark-ec2 launch

2014-02-24 Thread nicholas.chammas
I'm seeing a bunch of (apparently) non-critical errors when launching new clusters with spark-ec2 0.9.0. Here are some of them (emphasis added; names redacted): Generating cluster's SSH key on master... ssh: connect to host ec2-redacted.compute-1.amazonaws.com port 22: Connection refused

How to get well-distribute partition

2014-02-24 Thread zhaoxw12
I use spark-0.8.0. This is my code in python. list = [(i, i*i) for i in xrange(0, 16)]*10 rdd = sc.parallelize(list, 80) temp = rdd.collect() temp2 = rdd.partitionBy(16, lambda x: x ) count = 0 for i in temp2.glom().collect(): print count, **, i count += 1 This will get result below: 0 **