Hmm, interesting. I created
https://issues.apache.org/jira/browse/SPARK-1499to track the issue of
Workers continuously spewing bad executors, but the
real issue seems to be a combination of that and some other bug in Shark or
Spark which fails to handle the situation properly.
Please let us know
Hey, I was talking about something more like:
val size = 1024 * 1024
val numSlices = 8
val arr = Array.fill[Array[Int]](numSlices) { new Array[Int](size /
numSlices) }
val rdd = sc.parallelize(arr, numSlices).cache()
val size2 = rdd.map(_.length).sum()
assert( size2 ==
I want to apply the following transformations to 60Gbyte data on 7nodes with
10Gbyte memory. And I am wondering if groupByKey() function returns a RDD
with a single partition for each key? if so, what will happen if the size of
the partition doesn't fit into that particular node?
rdd =
Hi Debasish,
I found PageRank LiveJournal cost less than 100 seconds for GraphX in your
EC2. But as I use the example (LiveJournalPageRank) you provided in my
mechines with the same LiveJournal dataset, It took more than 10 minutes.
Following are some details:
Environment: 8 machines with each
The solution:
Edit /opt/spark-0.9.0-incubating-bin-hadoop2/conf/log4j.properties, changing
Spark's output to WARN. Done!
Refer to:
https://github.com/amplab-extras/SparkR-pkg/blob/master/pkg/src/src/main/resources/log4j.properties#L8
eduardocalfaia wrote
Have you already tried in
Hi, all
My spark program always gives me the error java.lang.OutOfMemoryError: Java
heap space in my standalone cluster, here is my code:
object SimCalcuTotal {
def main(args: Array[String]) {
val sc = new SparkContext(spark://192.168.2.184:7077, Sim Calcu
Total,
Hi all,
I am evaluating Spark to use here at my work.
We have an existing Hadoop 1.x install which I planning to upgrade to Hadoop
2.3.
I am trying to work out whether I should install YARN or simply just setup a
Spark standalone cluster. We already use ZooKeeper so it isn't a problem to
setup
Prashant,
In another email thread several weeks ago, it was mentioned that YARN
support is considered beta until Spark 1.0. Is that not the case?
-Suren
On Tue, Apr 15, 2014 at 8:38 AM, Prashant Sharma scrapco...@gmail.comwrote:
Hi Ishaaq,
answers inline from what I know, I had like to be
This would be super useful. Thanks.
On 4/15/14, 1:30 AM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Hi Andrew,
I'm putting together some benchmarks for PySpark vs Scala. I'm focusing on
ML algorithms, as I'm particularly curious about the relative performance
of
MLlib in Scala vs the Python
I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb
Given the size, and that it is a single file, I assumed it would only be in
a single partition. But when I cache it, I can see in the Spark App UI
that it actually splits it into two partitions:
[image: Inline image 1]
Is this
Thanks Aaron, this is useful !
- Manoj
On Mon, Apr 14, 2014 at 8:12 PM, Aaron Davidson ilike...@gmail.com wrote:
Launching drivers inside the cluster was a feature added in 0.9, for
standalone cluster mode:
Ah, I think I can see where your issue may be coming from. In spark-shell,
the MASTER is local[*], which just means it uses a pre-set number of
cores. This distinction only matters because the default number of slices
created from sc.parallelize() is based on the number of cores.
So when you run
Take a look at the minSplits argument for SparkContext#textFile [1] -- the
default value is 2. You can simply set this to 1 if you'd prefer not to
split your data.
[1]
http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext
On Tue, Apr 15, 2014 at 8:44 AM, Diana
Hi all,
As a previous thread, I am asking how to implement a divide-and-conquer
algorithm (skyline) in Spark.
Here is my current solution:
val data = sc.textFile(…).map(line = line.split(“,”).map(_.toDouble))
val result = data.mapPartitions(points =
Hi All,
I am desperately looking for some help.
My cluster is 6 nodes having dual core and 8GB ram each. Spark version
running on the cluster is spark-0.9.0-incubating-bin-cdh4.
I am getting OutOfMemoryError when running a Spark Streaming job
(non-streaming version works fine) which queries
Yup, one reason it’s 2 actually is to give people a similar experience to
working with large files, in case their code doesn’t deal well with the file
being partitioned.
Matei
On Apr 15, 2014, at 9:53 AM, Aaron Davidson ilike...@gmail.com wrote:
Take a look at the minSplits argument for
Hi Folks,
I have some questions about how Spark scheduler works:
- How does Spark know how many resources a job might need?
- How does it fairly share resources between multiple jobs?
- Does it know about data and partition sizes and use that information
for scheduling?
Mohit.
Your Spark solution first reduces partial results into a single partition,
computes the final result, and then collects to the driver side. This
involves a shuffle and two waves of network traffic. Instead, you can
directly collect partial results to the driver and then computes the final
results
Andrew,
Thank you very much for your feedback. Unfortunately, the ranges are not
of predictable size but you gave me an idea of how to handle it. Here's
what I'm thinking:
1. Choose number of partitions, n, over IP space
2. Preprocess the IPRanges, splitting any of them that cross partition
I've received the same error with Spark built using Maven. It turns out that
mesos-0.13.0 depends on protobuf-2.4.1 which is causing the clash at
runtime. Protobuf included by Akka is shaded and doesn't cause any problems.
The solution is to update the mesos dependency to 0.18.0 in spark's
I'm thinking of creating a union type for the key so that IPRange and IP
types can be joined.
On Tue, Apr 15, 2014 at 10:44 AM, Roger Hoover roger.hoo...@gmail.comwrote:
Andrew,
Thank you very much for your feedback. Unfortunately, the ranges are not
of predictable size but you gave me an
Actually altering the classpath in the REPL causes the provided
SparkContext to disappear:
scala sc.parallelize(List(1,2,3))
res0: spark.RDD[Int] = ParallelCollectionRDD[0] at parallelize at
console:13
scala :cp /root
Added '/root'. Your new classpath is:
Hello,
Currently I deployed 0.9.1 spark using a new way of starting up spark
exec start-stop-daemon --start --pidfile /var/run/spark.pid
--make-pidfile --chuid ${SPARK_USER}:${SPARK_GROUP} --chdir ${SPARK_HOME}
--exec /usr/bin/java -- -cp ${CLASSPATH}
It depends on your algorithm but I guess that you probably should use
reduce (the code probably doesn't compile but it shows you the idea).
val result = data.reduce { case (left, right) =
skyline(left ++ right)
}
Or in the case you want to merge the result of a partition with another one
you
This is probably related to the Scala bug that :cp does not work:
https://issues.scala-lang.org/browse/SI-6502
On Tue, Apr 15, 2014 at 11:21 AM, Walrus theCat walrusthe...@gmail.comwrote:
Actually altering the classpath in the REPL causes the provided
SparkContext to disappear:
scala
Dankeschön !
On Tue, Apr 15, 2014 at 11:29 AM, Aaron Davidson ilike...@gmail.com wrote:
This is probably related to the Scala bug that :cp does not work:
https://issues.scala-lang.org/browse/SI-6502
On Tue, Apr 15, 2014 at 11:21 AM, Walrus theCat walrusthe...@gmail.comwrote:
Actually
Hi,
after starting the shark-shell
via /opt/shark/shark-0.9.1/bin/shark-withinfo -skipRddReload I receive lots
of output, including the exception that /bin/java cannot be executed. But
it is linked to /usr/bin/java ?!?!
root#ls -al /bin/java
lrwxrwxrwx 1 root root 13 15. Apr 21:45 /bin/java
Hi,
I have a problem when i want to use spark kryoserializer by extending a
class Kryoregistarar to register custom classes inorder to create objects.I
am getting following exception When I run following program..Please let me
know what could be the problem...
] (run-main)
Hi,
I am testing ALS using 7 nodes. Each node has 4 cores and 8G memeory. ALS
program cannot run even with a very small size of training data (about 91
lines) due to StackVverFlow error when I set the number of iterations to
100. I think the problem may be caused by updateFeatures method which
Looking at the Python version of
textFile()http://spark.apache.org/docs/latest/api/pyspark/pyspark.context-pysrc.html#SparkContext.textFile,
shouldn't it be *max*(self.defaultParallelism, 2)?
If the default parallelism is, say 4, wouldn't we want to use that for
minSplits instead of 2?
On Tue,
What is the support for multi-tenancy in Spark.
I assume more than one driver can share the same cluster, but can a driver run
two jobs in parallel?
I am getting a java.net.SocketException: Network is unreachable whenever i
do a count on one of my tables.
If i just do a take(1), i see the task status as killed on the master UI
but i get back the results.
My driver runs on my local system which is accessible over the public
internet and
I am a dork please disregard this issue. I did not have the slaves
correctly configured. This error is very misleading
On Tue, Apr 15, 2014 at 11:21 AM, Paul Schooss paulmscho...@gmail.comwrote:
Hello,
Currently I deployed 0.9.1 spark using a new way of starting up spark
exec
Yes, both things can happen. Take a look at
http://spark.apache.org/docs/latest/job-scheduling.html, which includes
scheduling concurrent jobs within the same driver.
Matei
On Apr 15, 2014, at 4:08 PM, Ian Ferreira ianferre...@hotmail.com wrote:
What is the support for multi-tenancy in
Thanks Matei!
Sent from my Windows Phone
From: Matei Zahariamailto:matei.zaha...@gmail.com
Sent: 4/15/2014 7:14 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Multi-tenant?
Yes, both things can happen. Take a look at
Hi Prasad
Sorry for missing your reply.
https://gist.github.com/thegiive/10791823
Here it is.
Wisely Chen
On Fri, Apr 4, 2014 at 11:57 PM, Prasad ramachandran.pra...@gmail.comwrote:
Hi Wisely,
Could you please post your pom.xml here.
Thanks
--
View this message in context:
thank you so much, davidson
ye, you are right, in both sbt and spark shell, the result of my code is
28MB, it's irrelevant to numSlices.
yesterday i had the result of 4.2MB in spark shell, because i remove array
initialization for laziness:)
for(i - 0 until size) {
array(i) = i
}
--
Probably this JIRA
issuehttps://spark-project.atlassian.net/browse/SPARK-1006solves
your problem. When running with large iteration number, the lineage
DAG of ALS becomes very deep, both DAGScheduler and Java serializer may
overflow because they are implemented in a recursive way. You may resort
Has anyone got this working? I have enabled the properties for it in the
metrics.conf file and ensure that it is placed under spark's home
directory. Any ideas why I don't see spark beans ?
home directory or $home/conf directory? works for me with
metrics.properties hosted under conf dir.
On Tue, Apr 15, 2014 at 6:08 PM, Paul Schooss paulmscho...@gmail.comwrote:
Has anyone got this working? I have enabled the properties for it in the
metrics.conf file and ensure that it is
In the worker logs i can see,
14/04/16 01:02:47 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@xx:10548] -
[akka.tcp://sparkExecutor@xx:16041]: Error [Association failed with
[akka.tcp://sparkExecutor@xx:16041]] [
akka.remote.EndpointAssociationException: Association
Thanks a lot for your information. It really helps me.
On Tue, Apr 15, 2014 at 7:57 PM, Cheng Lian lian.cs@gmail.com wrote:
Probably this JIRA
issuehttps://spark-project.atlassian.net/browse/SPARK-1006solves your
problem. When running with large iteration number, the lineage
DAG of
Eugen,
Thanks for your tip and I do want to merge the result of a partition with
another one but I am still not quite clear how to do it.
Say the original data rdd has 32 partitions and since mapPartitions won’t
change the number of partitions, it will remain 32 partitions which each
contains
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-difference-between-element-and-partition-tp4317.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I was wonder if groupByKey returns 2 partitions in the below example?
x = sc.parallelize([(a, 1), (b, 1), (a, 1)])
sorted(x.groupByKey().collect())
[('a', [1, 1]), ('b', [1])]
--
View this message in context:
45 matches
Mail list logo