Which class is not Serializable?
I run shark0.9 has a similarity exception:
java.io.NotSerializableException (java.io.NotSerializableException:
shark.execution.ReduceKeyReduceSide)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
Shark-server crashes after some-times when running big queries ...
Any suggestion how to get rid of it..
show tables exception:
[Thrift Error]: java.net.SocketException: Socket closed
[Thrift Error]: Hive server is not cleaned due to thrift exception:
java.net.SocketException: Socket closed
Hi,
I'm trying the spark on yarn example at
https://spark.incubator.apache.org/docs/latest/running-on-yarn.html
When I try to run the SparkPi example using the spark-class command, the
job fails and In the stderr file of the job logs, I see the following error.
java.lang.ClassNotFoundException:
Hello there,
I'm having some trouble with my spark-cluster consisting of
master.censored.dev and
spark-worker-0
Reading from the output of pyspark, master, and worker-node it seems like
the cluster is formed correctly and pyspark connects to it. But for some
reason, nothing happens after
In any case,
I am running the same version of spark standalone on the cluster as the
jobserver (I compiled the master branch as opposed to the jobserver
branch, not sure if this matters). I then proceeded to change the
application.conf file to reflect the spark://master_ip:7077 as the master.
Figured it out. I did sbt/sbt assembly on the same jobserver branch and
am running that as a standalone spark cluster. I am then running a
separate jobserver from the same branch - it all works now.
Ognen
On 2/24/14, 9:02 AM, Ognen Duzlevski wrote:
In any case,
I am running the same version
Hi Evan,
Thanks for the quick response. The only mapping between UUIDs and Longs that
I can think of is one where I sequentially assign Longs as I load the UUIDs
from the DB. But this results in having to centralize this mapping. I am
guessing that centralizing this is not a good idea for a
Hi Josh,
Thanks for your quick response. Yes, it is a practical option, but my
concern is the need to centralize this mapping. Please see my response to
Evan's response.
Thanks.
-deepak
--
View this message in context:
You can almost certainly
take half of the UUID safely, assuming you're using random UUIDs. You
could work out the math if you're really concerned, but the probability
of a collision in 64 bits is probably pretty low even with a very large
data set. If your UUIDs aren't version 4, you probably
In addition, you can
easily verify there are no collisions with Spark before running anything
through GraphX -- create the mapping and then groupByKey to find any
keys with multiple mappings.
Ewen Cheslack-Postava
February 24, 2014
at 10:58 AM
You can almost certainly
Thanks Ewen, I will look into using half the UUID (we are indeed using random
(version 4) UUIDs).
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-with-UUID-vertex-IDs-instead-of-Long-tp1953p1989.html
Sent from the Apache Spark User List mailing list
Thanks Christopher, I will look into the StackOverflow suggestion of
generating 64-bit UUIDs in the same fashion as 128-bit UUIDs.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-with-UUID-vertex-IDs-instead-of-Long-tp1953p1990.html
Sent from the
A lot of great suggestions here that I am going to investigate. In parallel,
I would like to explore the possibility of having GraphX be parameterized on
the VertexId type. Is that a question for the developer mailing list?
Thanks.
-deepak
--
View this message in context:
Deepak, to be sure, I was referring to sequential guarantees with the longs.
I would suggest being careful with taking half the UUID as the probability
of collision can be unexpectedly high. Many bits of the UUID is typically
time-based so collision among those bits is virtually guaranteed with
i was under the impression that running jobs could not evict cached rdds
from memory as long as they are below spark.storage.memoryFraction. however
what i observe seems to indicate the opposite. did anything change?
thanks! koert
collect() means to bring all the data back to the master node, and there might
just be too much of it for that. How big is your file? If you can’t bring it
back to the master node try saveAsTextFile to write it out to a filesystem (in
parallel).
Matei
On Feb 24, 2014, at 1:08 PM, Chengi Liu
Its around 10 GB big?
All I want is to do a frequency count? And then get top 10 entries based
on count?
How do i do this (again on pyspark(
Thanks
On Mon, Feb 24, 2014 at 1:19 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
collect() means to bring all the data back to the master node, and
Thanks Christopher. I too am not comfortable with halving the random UUIDs,
and thanks to your response, I don't need to do the math :-). The
StackOverflow link you suggested had a different set of ideas that I am bit
more comfortable with, but the one I am still hoping for is the use of UUIDs
as
Hi,
Using pyspark for the first time on realistic dataset ( few hundred
GB's) but have been seeing a lot of errors on pyspark shell? This might be
because maybe I am not using pyspark correctly?
But here is what I was trying:
extract_subs.take(2)
//returns [u'867430', u'867429']
Yeah, so the problem is that countByValue returns *all* values and their counts
to your machine. If you just want the top 10, try this:
# do a distributed count using reduceByKey
counts = data.map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b)
# reverse the (key, count) pairs into (count,
I'm seeing a bunch of (apparently) non-critical errors when launching new
clusters with spark-ec2 0.9.0.
Here are some of them (emphasis added; names redacted):
Generating cluster's SSH key on master...
ssh: connect to host ec2-redacted.compute-1.amazonaws.com port 22:
Connection refused
I use spark-0.8.0. This is my code in python.
list = [(i, i*i) for i in xrange(0, 16)]*10
rdd = sc.parallelize(list, 80)
temp = rdd.collect()
temp2 = rdd.partitionBy(16, lambda x: x )
count = 0
for i in temp2.glom().collect():
print count, **, i
count += 1
This will get result below:
0 **
22 matches
Mail list logo