Spark temp dir (spark.local.dir)

2014-03-13 Thread Tsai Li Ming
Hi, I'm confused about the -Dspark.local.dir and SPARK_WORKER_DIR(--work-dir). What's the difference? I have set -Dspark.local.dir for all my worker nodes but I'm still seeing directories being created in /tmp when the job is running. I have also tried setting -Dspark.local.dir when I run the

Re: Spark temp dir (spark.local.dir)

2014-03-13 Thread Guillaume Pitel
I'm not 100% sure but I think it goes like this : spark.local.dir can and should be set both on the executors and on the driver (if the driver broadcast variables, the files will be stored in this directory) the SPARK_WORKER_DIR is where the

Re: Spark temp dir (spark.local.dir)

2014-03-13 Thread Tsai Li Ming
spark.local.dir can and should be set both on the executors and on the driver (if the driver broadcast variables, the files will be stored in this directory) Do you mean the worker nodes? Don’t think they are jetty connectors and the directories are empty:

Re: Spark temp dir (spark.local.dir)

2014-03-13 Thread Guillaume Pitel
spark.local.dir can and should be set both on the executors and on the driver (if the driver broadcast variables, the files will be stored in this directory)

parson json within rdd's filter()

2014-03-13 Thread Ognen Duzlevski
Hello, Is there anything special about calling functions that parse json lines from filter? I have code that looks like this: jsonMatches(line:String):Boolean = { take a line in json format val jline=parse(line) val je = jline \ event if (je != JNothing je.values.toString ==

Re: TriangleCount Shortest Path under Spark

2014-03-13 Thread Keith Massey
The triangle count failed for me when I ran it on more than one node. There was this assertion in TriangleCount.scala: // double count should be even (divisible by two) assert((dblCount 1) == 0) That did not hold true when I ran this on multiple nodes, even when following the

Re: parson json within rdd's filter()

2014-03-13 Thread Paul Brown
It's trying to send You just need to have the jsonMatches function available on the worker side of the interaction rather than on the driver side, e.g., put it on an object CodeThatIsRemote that gets shipped with the JARs and then filter(CodeThatIsRemote.jsonMatches) and you should be off to the

Re: How to monitor the communication process?

2014-03-13 Thread Mayur Rustagi
You can check out Ganglia for network utilization. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Mar 13, 2014 at 2:04 AM, moxiecui moxie...@gmail.com wrote: Hello everyone: Say I have a application run on

Re: parson json within rdd's filter()

2014-03-13 Thread Ognen Duzlevski
Hmm. The whole thing is packaged in a .jar file and I execute .addJar on the SparkContext. My expectation is that the whole jar together with that function is available on every worker automatically. Is that not a valid expectation? Ognen On 3/13/14, 11:09 AM, Paul Brown wrote: It's

Re: parson json within rdd's filter()

2014-03-13 Thread Paul Brown
Well, the question is how you're referencing it. If you reference it in a static fashion (function on an object, Scala-wise), then that's dereferenced on the worker side. If you reference it in a way that refers to something on the driver side, serializing the block will attempt to serialize the

Re: parson json within rdd's filter()

2014-03-13 Thread Ognen Duzlevski
I must be really dense! :) Here is the most simplified version of the code, I removed a bunch of stuff and hard-coded the event and Sign Up lines. def jsonMatches(line:String):Boolean = { val jLine = parse(line) // extract the event: from the line val e = jLine \ event

Re: parson json within rdd's filter()

2014-03-13 Thread Ognen Duzlevski
I even tried this: def jsonMatches(line:String):Boolean = true It is still failing with the same error. Ognen On 3/13/14, 11:45 AM, Ognen Duzlevski wrote: I must be really dense! :) Here is the most simplified version of the code, I removed a bunch of stuff and hard-coded the event and

How to work with ReduceByKey?

2014-03-13 Thread goi cto
Hi, I have an RDD with S,Tuple2I,List which I want to reduceByKey and get I+I and List of List (add the integers and build a list of the lists. BUT reduce by key requires that the return value is of the same type of the input so I can combine the lists.

Re: Spark Java example using external Jars

2014-03-13 Thread Adam Novak
Have a look at my project: https://github.com/adamnovak/sequence-graphs/blob/master/importVCF/src/main/scala/importVCF.scala. I use the SBT Native Packager, which dumps my jar and all its dependency jars into one directory. Then I have my code find the jar it's running from, and loop through that

Re: SparkContext startup time out

2014-03-13 Thread velvia
By the way, this is the underlying error for me: java.lang.VerifyError: (class: org/jboss/netty/channel/socket/nio/NioWorkerPool, method: createWorker signature: (Ljava/util/concurrent/Executor;)Lorg/jboss/netty/channel/socket/nio/AbstractNioWorker;) Wrong return type in function at

Re: sample data for pagerank?

2014-03-13 Thread Mo
You can find it here: https://github.com/apache/incubator-spark/tree/master/graphx/data On Thu, Mar 13, 2014 at 10:13 AM, Diana Carroll dcarr...@cloudera.comwrote: I'd like to play around with the Page Rank example included with Spark but I can't find that any sample data to work with is

Kafka in Yarn

2014-03-13 Thread aecc
Hi, I would like to know how is the correct way to add kafka to my project in StandAlone YARN, given that now it's in a different artifact than the Spark core. I tried adding the dependency to my project but I get a NotClassFoundException to my main class. Also, that makes my Jar file very big,

Re: links for the old versions are broken

2014-03-13 Thread Aaron Davidson
Looks like everything from 0.8.0 and before errors similarly (though Spark 0.3 for Scala 2.9 has a malformed link as well). On Thu, Mar 13, 2014 at 10:52 AM, Walrus theCat walrusthe...@gmail.comwrote: Sup, Where can I get Spark 0.7.3? It's 404 here: http://spark.apache.org/downloads.html

Reading back a sorted RDD

2014-03-13 Thread Aureliano Buendia
Hi, After sorting an RDD and writing to hadoop, would the RDD be still sorted when reading it back? Can sorting be guaranteed after reading back, when the RDD was written as 1 partition with rdd.coalesce(1)?

Re: parson json within rdd's filter()

2014-03-13 Thread Ognen Duzlevski
OK, problem solved. Interesting thing - I separated the jsonMatches function below and put it in as a method to a separate file/object. Once done that way, it all serializes and works. Ognen On 3/13/14, 11:52 AM, Ognen Duzlevski wrote: I even tried this: def

Re: Round Robin Partitioner

2014-03-13 Thread Patrick Wendell
In Spark 1.0 we've added better randomization to the scheduling of tasks so they are distributed more evenly by default. https://github.com/apache/spark/commit/556c56689bbc32c6cec0d07b57bd3ec73ceb243e However having specific policies like that isn't really supported unless you subclass the RDD

combining operations elegantly

2014-03-13 Thread Koert Kuipers
not that long ago there was a nice example on here about how to combine multiple operations on a single RDD. so basically if you want to do a count() and something else, how to roll them into a single job. i think patrick wendell gave the examples. i cant find them anymore patrick can you

Re: Local Standalone Application and shuffle spills

2014-03-13 Thread Aaron Davidson
The amplab spark internals talk you mentioned is actually referring to the RDD persistence levels, where by default we do not persist RDDs to disk ( https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence ). spark.shuffle.spill refers to a different behavior -- if the