i was under the impression that running jobs could not evict cached rdds
from memory as long as they are below spark.storage.memoryFraction. however
what i observe seems to indicate the opposite. did anything change?
thanks! koert
looks like it could be kryo related?
i am only guessing here, but you can configure kryo buffers separately...
see:
spark.kryoserializer.buffer.mb
On Mon, Feb 17, 2014 at 7:49 PM, agg wrote:
> Hi guys,
>
> I'm trying to run a basic version of kmeans (not the mllib version), on
> 250gb of data
it is possible to run multiple queries using a shared SparkContext (which
holds the shared RDD). however this is not easily available in spark-shell
i believe.
alternatively tachyon can be used to share (serialized) RDDs
On Mon, Feb 17, 2014 at 11:41 AM, David Thomas wrote:
> Is it possible fo
in the documentation for running spark on yarn it states:
"We do not requesting container resources based on the number of cores.
Thus the numbers of cores given via command line arguments cannot be
guaranteed."
can someone explain this a bit more?
is it simply a reflection of the fact that yarn
hedulerBackend.scala#L156>
>> .
>>
>> Simplest possibility is that you're setting spark.default.parallelism,
>> otherwise there may be a bug introduced somewhere that isn't defaulting
>> correctly anymore.
>>
>>
>> On Sat, Feb 1, 2014 at 12:30
i just managed to upgrade my 0.9-SNAPSHOT from the last scala 2.9.x version
to the latest.
everything seems good except that my default parallelism is now set to 2
for jobs instead of some smart number based on the number of cores (i think
that is what it used to do). it this change on purpose?
since we are still on scala 2.9.x and trunk migrated to 2.10.x i hope
graphx will get merged into the 0.8.x series at some point, and not just
0.9.x (which is now scala 2.10), since that would make it hard for us to
use in the near future.
best, koert
with long running apps i see this at times:
13/12/21 12:57:59 INFO scheduler.Stage: Stage 1 is now unavailable on
executor 10 (0/66, false)
13/12/21 12:58:19 WARN storage.BlockManagerMasterActor: Removing
BlockManager BlockManagerId(1, node10, 33734, 0) with no recent heart
beats: 50227ms exceeds
check out the
> master branch. These patch can support to access hdfs with the username you
> start the Spark application, not the one who starts Spark service.
>
>
>
> Thanks
>
> Jerry
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* Friday, December
Hey Philip,
how do you get spark to write to hdfs with your user name? When i use spark
it writes to hdfs as the user that runs the spark services... i wish it
read and wrote as me.
On Thu, Dec 12, 2013 at 6:37 PM, Philip Ogren wrote:
> When I call rdd.saveAsTextFile("hdfs://...") it uses my use
message as to why the calculation failed (as opposed to: fetch failed more
than 4 times).
On Fri, Nov 29, 2013 at 3:09 PM, Koert Kuipers wrote:
> in 0.9-SNAPSHOT StageInfo has been changed to make the stage itself no
> longer accessible.
>
> however the stage contains the rdd, which is
in 0.9-SNAPSHOT StageInfo has been changed to make the stage itself no
longer accessible.
however the stage contains the rdd, which is necessary to tie this
StageInfo to an RDD. now all we have is the rddName. is the rddName
guaranteed to be unique, and can it be relied upon to identify RDDs?
the core of hadoop is currently hdfs + mapreduce. the more appropriate
question is if it will become hdfs + spark. so will spark overtake
mapreduce as the dominant computational engine? its a very serious
candidate for that i think. it can do many things mapreduce cannot do, and
has an awesome api.
in fact co-partitioning was one of the main reason we started using spark.
in map-reduce its a giant pain to implement
On Sat, Nov 16, 2013 at 3:05 PM, Koert Kuipers wrote:
> we use PartitionBy a lot to keep multiple datasets co-partitioned before
> caching.
> it works well.
>
>
we use PartitionBy a lot to keep multiple datasets co-partitioned before
caching.
it works well.
On Sat, Nov 16, 2013 at 5:10 AM, guojc wrote:
> After looking at the api more carefully, I just found I overlooked the
> partitionBy function on PairRDDFunction. It's the function I need. Sorry
>
:
> Hey Koert,
>
> Can you give me steps to reproduce this ?
>
>
> On Tue, Oct 29, 2013 at 10:06 AM, Koert Kuipers wrote:
>
>> Matei,
>> We have some jobs where even the input for a single key in a groupBy
>> would not fit in the the tasks memory. We rely on m
an explicit goal of the
> project to support it.
>
> Matei
>
> On Oct 28, 2013, at 5:32 PM, Koert Kuipers wrote:
>
> no problem :) i am actually not familiar with what oscar has said on this.
> can you share or point me to the conversation thread?
>
> it is my opinion bas
e please elaborate on why this would be the case and what
>>> stops Spark, as it is today, to be successfully run on very large datasets?
>>> I'll appreciate it.
>>>
>>> I would think that Spark should be able to pull off Hadoop level
>>> throughput in worst
i would say scaling (cascading + DSL for scala) offers similar
functionality to spark, and a similar syntax.
the main difference between spark and scalding is target jobs:
scalding is for long running jobs on very large data. the data is read from
and written to disk between steps. jobs run from mi
d with spark is on the classpath ahead of the one i
> included with my job. is this the case? how do i put my jars first on the
> spark claspath? this should generally be possible as people can include
> newer versions of jars...
>
>
> On Fri, Oct 18, 2013 at 7:30 PM, K
the one i
included with my job. is this the case? how do i put my jars first on the
spark claspath? this should generally be possible as people can include
newer versions of jars...
On Fri, Oct 18, 2013 at 7:30 PM, Koert Kuipers wrote:
> woops wrong mailing list
>
> -
woops wrong mailing list
-- Forwarded message --
From: Koert Kuipers
Date: Fri, Oct 18, 2013 at 7:29 PM
Subject: snappy
To: spark-us...@googlegroups.com
the snappy bundled with spark 0.8 is causing trouble on CentOS 5:
java.lang.UnsatisfiedLinkError: /tmp/snappy-1.0.5
OK it turned out setting
-Dspark.serializer=org.apache.spark.serializer.KryoSerializer in
SPARK_JAVA_OPTS on the workers/slaves caused all this. not sure why. this
used to work fine in previous spark. but when i removed it the errors went
away.
On Fri, Oct 18, 2013 at 2:59 PM, Koert Kuipers
SPARK_USER_CLASSPATH
export SPARK_JAVA_OPTS="-Dspark.worker.timeout=3
-Dspark.akka.timeout=3 -Dspark.storage.blockManagerHeartBeatMs=12
-Dspark.storage.blockManagerTimeoutIntervalMs=120
000 -Dspark.akka.retry.wait=3 -Dspark.akka.frameSize=1
-Dspark.akka.logLifecycleEvents=true
-Dspark.serializer=o
i checked out the v0.8.0-incubating tag again, changed the settings to
build against correct version of hadoop for our cluster, ran sbt-assembly,
build tarball, installed it on cluster, restarted spark... same errors
On Fri, Oct 18, 2013 at 12:49 PM, Koert Kuipers wrote:
> at this point i f
at this point i feel like it must be some sort of version mismatch? i am
gonna check the spark build that i deployed on the cluster
On Fri, Oct 18, 2013 at 12:46 PM, Koert Kuipers wrote:
> name := "Simple Project"
>
> version := "1.0"
>
> scalaVersion
> Can you post the build.sbt for your program? It needs to include
> hadoop-client for CDH4.3, and that should *not* be listed as provided.
>
> Matei
>
> On Oct 18, 2013, at 8:23 AM, Koert Kuipers wrote:
>
> ok this has nothing to do with hadoop access. even a simple progr
$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
On Fri, Oct 18, 2013 at 10:59 AM, Koert Kuipers wrote:
> i created a tiny sbt project as described here:
> apac
ts epoch is " + task.epoch)
i am guessing accessing epoch on the task is throwing the NPE. any ideas?
On Thu, Oct 17, 2013 at 8:12 PM, Koert Kuipers wrote:
> sorry one more related question:
> i compile against a spark build for hadoop 1.0.4, but the actual installed
> version
8:06 PM, Koert Kuipers wrote:
> i have my spark and hadoop related dependencies as "provided" for my spark
> job. this used to work with previous versions. are these now supposed to be
> compile/runtime/default dependencies?
>
>
> On Thu, Oct 17, 2013 at 8:04 PM, Koer
i have my spark and hadoop related dependencies as "provided" for my spark
job. this used to work with previous versions. are these now supposed to be
compile/runtime/default dependencies?
On Thu, Oct 17, 2013 at 8:04 PM, Koert Kuipers wrote:
> yes i did that and i can see the
gt; your version of Hadoop. See
> http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
> for
> example.
>
> Matei
>
> On Oct 17, 2013, at 4:38 PM, Koert Kuipers wrote:
>
> i got the job a little further along by also setting this:
gt;>
>>>
>>> On Thu, Oct 17, 2013 at 6:15 PM, Mark Hamstra
>>> wrote:
>>>
>>>> Of course, you mean 0.9.0-SNAPSHOT. There is no Spark 0.9.0, and won't
>>>> be for several months.
>>>>
>>>>
>>>>
&g
;Simple Job",
"/usr/local/lib/spark",
List("dist/myjar-0.1-SNAPSHOT.jar"))
val songs = sc.textFile("hdfs://master:8020/user/koert/songs")
println("songs count: " + songs.count)
}
did something change in how i am supposed to launch jobs?
On Thu, O
I'm sorry if this doesn't answer your question directly, but I have
>>> tried spark 0.9.0 and hdfs 1.0.4 just now, it works..
>>>
>>>
>>> On Thu, Oct 17, 2013 at 6:05 PM, Koert Kuipers wrote:
>>>
>>>> after upgrading from spark 0.7 t
st now, it works..
>
>
> On Thu, Oct 17, 2013 at 6:05 PM, Koert Kuipers wrote:
>
>> after upgrading from spark 0.7 to spark 0.8 i can no longer access any
>> files on HDFS.
>> i see the error below. any ideas?
>>
>> i am running spark standalone on a clust
after upgrading from spark 0.7 to spark 0.8 i can no longer access any
files on HDFS.
i see the error below. any ideas?
i am running spark standalone on a cluster that also has CDH4.3.0 and
rebuild spark accordingly. the jars in lib_managed look good to me.
i noticed similar errors in the mailing
hello all,
currently spark runs tasks as the user that runs the spark worker daemon.
at least, this is what i observer with standalone spark.
as a result spark does not respect user permissions on HDFS.
i guess this could be fixed by running the tasks as a different user, or
even just by using ha
38 matches
Mail list logo