My thought would be to key by the first item in each array, then take just
one array for each key. Something like the below:
v = sc.parallelize(Seq(Seq(1,2,3,4),Seq(1,5,2,3),Seq(2,3,4,5)))
col = 0
output = v.keyBy(_(col)).reduceByKey(a,b = a).values
On Tue, Mar 25, 2014 at 1:21 AM, Chengi Liu
reopen this thread because i encounter this problem again.
Here is my env:
scala 2.10.3 s
spark 0.9.0tandalone mode
shark 0.9.0downlaod the source code and build by myself
hive hive-shark-0.11
I have copied hive-site.xml from my hadoop cluster , it's hive version is
0.12, after copied , i
This worked great. Thanks a lot
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Java-API-Serialization-Issue-tp1460p3178.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Is there a way to see the resource usage of each spark-shell command — say time
taken and memory used?
I checked the WebUI of spark-shell and of the master and I don’t see any such
breakdown. I see the time taken in the INFO logs but nothing about memory
usage. It would also be nice to track
Hi Qingyang Li,
Shark-0.9.0 uses a patched version of hive-0.11 and using
configuration/metastore of hive-0.12 could be incompatible.
May I know the reason you are using hive-site.xml from previous hive version(to
use existing metastore?). You might just leave hive-site.xml blank, otherwise.
Possibly one of your executors is in the middle of a large stop-the-world
GC and doesn't respond to network traffic during that period? If you
shared some information about how each node in your cluster is set up (heap
size, memory, CPU, etc) that might help with debugging.
Andrew
On Mon, Mar
Hi,
I have been following Aniket's spork github repository.
https://github.com/aniket486/pig
I have done all the changes mentioned in recently modified pig-spark file.
I am using:
hadoop 2.0.5 alpha
spark-0.8.1-incubating
mesos 0.16.0
##PIG variables
export
Hi Guys,
I think that I already did this question, but I don't remember if anyone
has answered me. I would like changing in the function print() the
quantity of words and the frequency number that are sent to driver's
screen. The default value is 10.
Anyone could help me with this?
Best
You can extend DStream and override print() method. Then you can create
your own DSTream extending from this.
On Tue, Mar 25, 2014 at 6:07 PM, Eduardo Costa Alfaia
e.costaalf...@unibs.it wrote:
Hi Guys,
I think that I already did this question, but I don't remember if anyone
has answered
Hi,
I have an example where I use a tuple of (int,int) in Python as key for a
RDD. When I do a reduceByKey(...), sometimes the tuples turn up with the
two int's reversed in order (which is problematic, as the ordering is part
of the key).
Here is a ipython notebook that has some code and
OK, forget about this question. It was a nasty, one character typo in my
own code (sorting by rating instead of item at one point).
Best,
Friso
On Tue, Mar 25, 2014 at 1:53 PM, Friso van Vollenhoven
f.van.vollenho...@gmail.com wrote:
Hi,
I have an example where I use a tuple of (int,int) in
Hi, I'm running benchmark, which compares Mahout and SparkML. For now I
have next results for k-means:
Number of iterations= 10, number of elements = 1000, mahouttime= 602,
spark time = 138
Number of iterations= 40, number of elements = 1000, mahouttime= 1917,
spark time = 330
Number of
Maybe with MEMORY_ONLY, spark has to recompute the RDD several times because
they don't fit in memory. It makes things run slower.
As a general safe rule, use MEMORY_AND_DISK_SER
Guillaume Pitel - Président d'eXenSa
Prashant Sharma scrapco...@gmail.com a écrit :
I think Mahout uses
Mahout does have a kmeans which can be executed in mapreduce and iterative
modes.
Sent from my iPhone
On Mar 25, 2014, at 9:25 AM, Prashant Sharma scrapco...@gmail.com wrote:
I think Mahout uses FuzzyKmeans, which is different algorithm and it is not
iterative.
Prashant Sharma
On
Mahout used MR and made one MR on every iteration. It worked as predicted.
My question more about why spark was so slow. I would try
MEMORY_AND_DISK_SER
2014-03-25 17:58 GMT+04:00 Suneel Marthi suneel_mar...@yahoo.com:
Mahout does have a kmeans which can be executed in mapreduce and iterative
I think Mahout uses FuzzyKmeans, which is different algorithm and it is not
iterative.
Prashant Sharma
On Tue, Mar 25, 2014 at 6:50 PM, Egor Pahomov pahomov.e...@gmail.comwrote:
Hi, I'm running benchmark, which compares Mahout and SparkML. For now I
have next results for k-means:
Number of
Let me rephrase that,
Do you think it is possible to use an accumulator to skip the first few
incomplete RDDs?
-Original Message-
From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: March-25-14 9:57 AM
To: user@spark.apache.org
Cc: u...@spark.incubator.apache.org
Subject: RE:
Hi
Is there a way in Spark to run a function on each executor just once. I have
a couple of use cases.
a) I use an external library that is a singleton. It keeps some global state
and provides some functions to manipulate it (e.g. reclaim memory. etc.) . I
want to check the global state of this
Deenar, when you say just once, have you defined across multiple what
(e.g., across multiple threads in the same JVM on the same machine)? In
principle you can have multiple executors on the same machine.
In any case, assuming it's the same JVM, have you considered using a
singleton that
Hi,
I'm trying to save an RDD to HDFS with the saveAsTextFile method on my ec2
cluster and am encountering the following exception (the app is called
GraphTest):
Exception failure: java.lang.ClassCastException: cannot assign instance of
GraphTest$$anonfun$3 to field
After digging deeper, I realized all the workers ran out of memory, giving
an hs_error.log file in /tmp/jvm-PID with the header:
# Native memory allocation (malloc) failed to allocate 2097152 bytes for
committing reserved memory.
# Possible reasons:
# The system is out of physical RAM or swap
Deenar, the singleton pattern I'm suggesting would look something like this:
public class TaskNonce {
private transient boolean mIsAlreadyDone;
private static transient TaskNonce mSingleton = new TaskNonce();
private transient Object mSyncObject = new Object();
public TaskNonce
Hi All,
I'm getting the following error when I execute start-master.sh which also
invokes spark-class at the end.
Failed to find Spark assembly in /root/spark/assembly/target/scala-2.10/
You need to build Spark with 'sbt/sbt assembly' before running this program.
After digging into the
The logs from the executor are redirected to stdout only because there is a
default log4j.properties that is configured to do so. If you put your
log4j.properties with rolling file appender in the classpath (refer to
Spark docs for that), all the logs will get redirected to a separate files
that
Starting with Spark 0.9 the protobuf dependency we use is shaded and
cannot interfere with other protobuf libaries including those in
Hadoop. Not sure what's going on in this case. Would someone who is
having this problem post exactly how they are building spark?
- Patrick
On Fri, Mar 21, 2014
Hi Paul,
I got it sorted out.
The problem is that the JARs are built into the assembly JARs when you run
sbt/sbt clean assembly
What I did is:sbt/sbt clean package
This will only give you the small JARs. The next steps is to update the
CLASSPATH in the bin/compute-classpath.sh script manually,
You can try to add the following to your shell:
In bin/compute-classpath.sh, append the JAR lzo JAR from Mapreduce:
CLASSPATH=$CLASSPATH:$HADOOP_HOME/share/hadoop/mapreduce/lib/hadoop-lzo.jar
export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native/
export
As promised, here is that follow-up post for those looking to get started
with Shark against Cassandra:
--
Brian ONeill
CTO, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42
As promised, here is that followup post for those looking to get started
with Shark against Cassandra:
http://brianoneill.blogspot.com/2014/03/shark-on-cassandra-w-cash-interrogating.html
Again -- thanks to Rohit and the team at TupleJump. Great work.
-brian
--
Brian ONeill
CTO, Health Market
Unfortunately there isnt one right now. But it is probably too hard to
start with the
JavaNetworkWordCounthttps://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/streaming/examples/JavaNetworkWordCount.java,
and use the ZeroMQUtils in the same way as the
2 good benefits of Streaming
1. maintains windows as you move across time, removing adding monads as
you move through the window
2. Connect with streaming systems like kafka to import data as it comes
process it
You dont seem to need any of these features, you would be better off using
Spark
You can probably do it in a simpler but sort of hacky way!
If your window size is W and sliding interval S, you can do some math to
figure out how many of the first windows are actually partial windows. Its
probably math.ceil(W/S) . So in a windowDStream.foreachRDD() you can
increment a global
Well, my long running app has 512M per executor on a 16 node cluster
where each machine has 16G of RAM. I could not run a second application
until I restricted the spark.cores.max. As soon as I restricted the
cores, I am able to run a second job at the same time.
Ognen
On 3/24/14, 7:46 PM,
Hi Folks,
Is this issue resolved ? If yes, could you please throw some light on how to
fix this ?
I am facing the same problem during writing to text files.
When I do
stream.foreachRDD(rdd ={
rdd.saveAsTextFile(Some path)
})
This works
34 matches
Mail list logo