cached rdd in memory eviction

2014-02-24 Thread Koert Kuipers
i was under the impression that running jobs could not evict cached rdds from memory as long as they are below spark.storage.memoryFraction. however what i observe seems to indicate the opposite. did anything change? thanks! koert

Re: OutOfMemoryError with basic kmeans

2014-02-17 Thread Koert Kuipers
looks like it could be kryo related? i am only guessing here, but you can configure kryo buffers separately... see: spark.kryoserializer.buffer.mb On Mon, Feb 17, 2014 at 7:49 PM, agg wrote: > Hi guys, > > I'm trying to run a basic version of kmeans (not the mllib version), on > 250gb of data

Re: Shared access to RDD

2014-02-17 Thread Koert Kuipers
it is possible to run multiple queries using a shared SparkContext (which holds the shared RDD). however this is not easily available in spark-shell i believe. alternatively tachyon can be used to share (serialized) RDDs On Mon, Feb 17, 2014 at 11:41 AM, David Thomas wrote: > Is it possible fo

yarn documentation

2014-02-17 Thread Koert Kuipers
in the documentation for running spark on yarn it states: "We do not requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed." can someone explain this a bit more? is it simply a reflection of the fact that yarn

Re: default parallelism in trunk

2014-02-02 Thread Koert Kuipers
hedulerBackend.scala#L156> >> . >> >> Simplest possibility is that you're setting spark.default.parallelism, >> otherwise there may be a bug introduced somewhere that isn't defaulting >> correctly anymore. >> >> >> On Sat, Feb 1, 2014 at 12:30

default parallelism in trunk

2014-02-01 Thread Koert Kuipers
i just managed to upgrade my 0.9-SNAPSHOT from the last scala 2.9.x version to the latest. everything seems good except that my default parallelism is now set to 2 for jobs instead of some smart number based on the number of cores (i think that is what it used to do). it this change on purpose?

graphx merge for scala 2.9

2013-12-27 Thread Koert Kuipers
since we are still on scala 2.9.x and trunk migrated to 2.10.x i hope graphx will get merged into the 0.8.x series at some point, and not just 0.9.x (which is now scala 2.10), since that would make it hard for us to use in the near future. best, koert

how to detect a disconnect

2013-12-21 Thread Koert Kuipers
with long running apps i see this at times: 13/12/21 12:57:59 INFO scheduler.Stage: Stage 1 is now unavailable on executor 10 (0/66, false) 13/12/21 12:58:19 WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, node10, 33734, 0) with no recent heart beats: 50227ms exceeds

Re: writing to HDFS with a given username

2013-12-13 Thread Koert Kuipers
check out the > master branch. These patch can support to access hdfs with the username you > start the Spark application, not the one who starts Spark service. > > > > Thanks > > Jerry > > *From:* Koert Kuipers [mailto:ko...@tresata.com] > *Sent:* Friday, December

Re: writing to HDFS with a given username

2013-12-12 Thread Koert Kuipers
Hey Philip, how do you get spark to write to hdfs with your user name? When i use spark it writes to hdfs as the user that runs the spark services... i wish it read and wrote as me. On Thu, Dec 12, 2013 at 6:37 PM, Philip Ogren wrote: > When I call rdd.saveAsTextFile("hdfs://...") it uses my use

Re: 0.9-SNAPSHOT StageInfo

2013-11-29 Thread Koert Kuipers
message as to why the calculation failed (as opposed to: fetch failed more than 4 times). On Fri, Nov 29, 2013 at 3:09 PM, Koert Kuipers wrote: > in 0.9-SNAPSHOT StageInfo has been changed to make the stage itself no > longer accessible. > > however the stage contains the rdd, which is

0.9-SNAPSHOT StageInfo

2013-11-29 Thread Koert Kuipers
in 0.9-SNAPSHOT StageInfo has been changed to make the stage itself no longer accessible. however the stage contains the rdd, which is necessary to tie this StageInfo to an RDD. now all we have is the rddName. is the rddName guaranteed to be unique, and can it be relied upon to identify RDDs?

Re: interesting question on quora

2013-11-18 Thread Koert Kuipers
the core of hadoop is currently hdfs + mapreduce. the more appropriate question is if it will become hdfs + spark. so will spark overtake mapreduce as the dominant computational engine? its a very serious candidate for that i think. it can do many things mapreduce cannot do, and has an awesome api.

Re: Does spark RDD has a partitionedByKey

2013-11-16 Thread Koert Kuipers
in fact co-partitioning was one of the main reason we started using spark. in map-reduce its a giant pain to implement On Sat, Nov 16, 2013 at 3:05 PM, Koert Kuipers wrote: > we use PartitionBy a lot to keep multiple datasets co-partitioned before > caching. > it works well. > >

Re: Does spark RDD has a partitionedByKey

2013-11-16 Thread Koert Kuipers
we use PartitionBy a lot to keep multiple datasets co-partitioned before caching. it works well. On Sat, Nov 16, 2013 at 5:10 AM, guojc wrote: > After looking at the api more carefully, I just found I overlooked the > partitionBy function on PairRDDFunction. It's the function I need. Sorry >

Re: compare/contrast Spark with Cascading

2013-10-29 Thread Koert Kuipers
: > Hey Koert, > > Can you give me steps to reproduce this ? > > > On Tue, Oct 29, 2013 at 10:06 AM, Koert Kuipers wrote: > >> Matei, >> We have some jobs where even the input for a single key in a groupBy >> would not fit in the the tasks memory. We rely on m

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Koert Kuipers
an explicit goal of the > project to support it. > > Matei > > On Oct 28, 2013, at 5:32 PM, Koert Kuipers wrote: > > no problem :) i am actually not familiar with what oscar has said on this. > can you share or point me to the conversation thread? > > it is my opinion bas

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Koert Kuipers
e please elaborate on why this would be the case and what >>> stops Spark, as it is today, to be successfully run on very large datasets? >>> I'll appreciate it. >>> >>> I would think that Spark should be able to pull off Hadoop level >>> throughput in worst

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Koert Kuipers
i would say scaling (cascading + DSL for scala) offers similar functionality to spark, and a similar syntax. the main difference between spark and scalding is target jobs: scalding is for long running jobs on very large data. the data is read from and written to disk between steps. jobs run from mi

Re: snappy

2013-10-18 Thread Koert Kuipers
d with spark is on the classpath ahead of the one i > included with my job. is this the case? how do i put my jars first on the > spark claspath? this should generally be possible as people can include > newer versions of jars... > > > On Fri, Oct 18, 2013 at 7:30 PM, K

Re: snappy

2013-10-18 Thread Koert Kuipers
the one i included with my job. is this the case? how do i put my jars first on the spark claspath? this should generally be possible as people can include newer versions of jars... On Fri, Oct 18, 2013 at 7:30 PM, Koert Kuipers wrote: > woops wrong mailing list > > -

Fwd: snappy

2013-10-18 Thread Koert Kuipers
woops wrong mailing list -- Forwarded message -- From: Koert Kuipers Date: Fri, Oct 18, 2013 at 7:29 PM Subject: snappy To: spark-us...@googlegroups.com the snappy bundled with spark 0.8 is causing trouble on CentOS 5: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.0.5

Re: spark 0.8

2013-10-18 Thread Koert Kuipers
OK it turned out setting -Dspark.serializer=org.apache.spark.serializer.KryoSerializer in SPARK_JAVA_OPTS on the workers/slaves caused all this. not sure why. this used to work fine in previous spark. but when i removed it the errors went away. On Fri, Oct 18, 2013 at 2:59 PM, Koert Kuipers

Re: spark 0.8

2013-10-18 Thread Koert Kuipers
SPARK_USER_CLASSPATH export SPARK_JAVA_OPTS="-Dspark.worker.timeout=3 -Dspark.akka.timeout=3 -Dspark.storage.blockManagerHeartBeatMs=12 -Dspark.storage.blockManagerTimeoutIntervalMs=120 000 -Dspark.akka.retry.wait=3 -Dspark.akka.frameSize=1 -Dspark.akka.logLifecycleEvents=true -Dspark.serializer=o

Re: spark 0.8

2013-10-18 Thread Koert Kuipers
i checked out the v0.8.0-incubating tag again, changed the settings to build against correct version of hadoop for our cluster, ran sbt-assembly, build tarball, installed it on cluster, restarted spark... same errors On Fri, Oct 18, 2013 at 12:49 PM, Koert Kuipers wrote: > at this point i f

Re: spark 0.8

2013-10-18 Thread Koert Kuipers
at this point i feel like it must be some sort of version mismatch? i am gonna check the spark build that i deployed on the cluster On Fri, Oct 18, 2013 at 12:46 PM, Koert Kuipers wrote: > name := "Simple Project" > > version := "1.0" > > scalaVersion

Re: spark 0.8

2013-10-18 Thread Koert Kuipers
> Can you post the build.sbt for your program? It needs to include > hadoop-client for CDH4.3, and that should *not* be listed as provided. > > Matei > > On Oct 18, 2013, at 8:23 AM, Koert Kuipers wrote: > > ok this has nothing to do with hadoop access. even a simple progr

Re: spark 0.8

2013-10-18 Thread Koert Kuipers
$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) On Fri, Oct 18, 2013 at 10:59 AM, Koert Kuipers wrote: > i created a tiny sbt project as described here: > apac

Re: spark 0.8

2013-10-18 Thread Koert Kuipers
ts epoch is " + task.epoch) i am guessing accessing epoch on the task is throwing the NPE. any ideas? On Thu, Oct 17, 2013 at 8:12 PM, Koert Kuipers wrote: > sorry one more related question: > i compile against a spark build for hadoop 1.0.4, but the actual installed > version

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
8:06 PM, Koert Kuipers wrote: > i have my spark and hadoop related dependencies as "provided" for my spark > job. this used to work with previous versions. are these now supposed to be > compile/runtime/default dependencies? > > > On Thu, Oct 17, 2013 at 8:04 PM, Koer

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
i have my spark and hadoop related dependencies as "provided" for my spark job. this used to work with previous versions. are these now supposed to be compile/runtime/default dependencies? On Thu, Oct 17, 2013 at 8:04 PM, Koert Kuipers wrote: > yes i did that and i can see the

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
gt; your version of Hadoop. See > http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala > for > example. > > Matei > > On Oct 17, 2013, at 4:38 PM, Koert Kuipers wrote: > > i got the job a little further along by also setting this:

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
gt;> >>> >>> On Thu, Oct 17, 2013 at 6:15 PM, Mark Hamstra >>> wrote: >>> >>>> Of course, you mean 0.9.0-SNAPSHOT. There is no Spark 0.9.0, and won't >>>> be for several months. >>>> >>>> >>>> &g

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
;Simple Job", "/usr/local/lib/spark", List("dist/myjar-0.1-SNAPSHOT.jar")) val songs = sc.textFile("hdfs://master:8020/user/koert/songs") println("songs count: " + songs.count) } did something change in how i am supposed to launch jobs? On Thu, O

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
I'm sorry if this doesn't answer your question directly, but I have >>> tried spark 0.9.0 and hdfs 1.0.4 just now, it works.. >>> >>> >>> On Thu, Oct 17, 2013 at 6:05 PM, Koert Kuipers wrote: >>> >>>> after upgrading from spark 0.7 t

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
st now, it works.. > > > On Thu, Oct 17, 2013 at 6:05 PM, Koert Kuipers wrote: > >> after upgrading from spark 0.7 to spark 0.8 i can no longer access any >> files on HDFS. >> i see the error below. any ideas? >> >> i am running spark standalone on a clust

spark 0.8

2013-10-17 Thread Koert Kuipers
after upgrading from spark 0.7 to spark 0.8 i can no longer access any files on HDFS. i see the error below. any ideas? i am running spark standalone on a cluster that also has CDH4.3.0 and rebuild spark accordingly. the jars in lib_managed look good to me. i noticed similar errors in the mailing

HDFS user permissions

2013-09-19 Thread Koert Kuipers
hello all, currently spark runs tasks as the user that runs the spark worker daemon. at least, this is what i observer with standalone spark. as a result spark does not respect user permissions on HDFS. i guess this could be fixed by running the tasks as a different user, or even just by using ha