Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Koert Kuipers
If you need quick response re-use your spark context between queries and cache rdds in memory On Mar 3, 2014 12:42 AM, polkosity polkos...@gmail.com wrote: Thanks for the advice Mayur. I thought I'd report back on the performance difference... Spark standalone mode has executors processing

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Koert Kuipers
yes, tachyon is in memory serialized, which is not as fast as cached in memory in spark (not serialized). the difference really depends on your job type. On Mon, Mar 3, 2014 at 7:10 PM, polkosity polkos...@gmail.com wrote: Thats exciting! Will be looking into that, thanks Andrew. Related

Re: trying to understand job cancellation

2014-03-05 Thread Koert Kuipers
i also noticed that jobs (with a new JobGroupId) which i run after this use which use the same RDDs get very confused. i see lots of cancelled stages and retries that go on forever. On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers ko...@tresata.com wrote: i have a running job that i cancel while

Re: trying to understand job cancellation

2014-03-05 Thread Koert Kuipers
at 2:40 PM, Koert Kuipers ko...@tresata.com wrote: SparkContext.cancelJobGroup On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: How do you cancel the job. Which API do you use? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Sbt Permgen

2014-03-10 Thread Koert Kuipers
/apache/spark/pull/103 On Sun, Mar 9, 2014 at 8:40 PM, Koert Kuipers ko...@tresata.com wrote: edit last line of sbt/sbt, after which i run: sbt/sbt test On Sun, Mar 9, 2014 at 10:24 PM, Sean Owen so...@cloudera.com wrote: How are you specifying these args? On Mar 9, 2014 8:55 PM, Koert

computation slows down 10x because of cached RDDs

2014-03-10 Thread Koert Kuipers
hello all, i am observing a strange result. i have a computation that i run on a cached RDD in spark-standalone. it typically takes about 4 seconds. but when other RDDs that are not relevant to the computation at hand are cached in memory (in same spark context), the computation takes 40 seconds

Re: computation slows down 10x because of cached RDDs

2014-03-10 Thread Koert Kuipers
also be good to see what percent of each GC generation is used. The concurrent mark-and-sweep GC -XX:+UseConcMarkSweepGC or the G1 GC in Java 7 (-XX:+UseG1GC) might also avoid these pauses by GCing concurrently with your application threads. Matei On Mar 10, 2014, at 3:18 PM, Koert Kuipers ko

combining operations elegantly

2014-03-13 Thread Koert Kuipers
not that long ago there was a nice example on here about how to combine multiple operations on a single RDD. so basically if you want to do a count() and something else, how to roll them into a single job. i think patrick wendell gave the examples. i cant find them anymore patrick can you

Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Koert Kuipers
i dont know anything about arm clusters but it looks great. what are the specs? the nodes have no local disk at all? On Tue, Mar 18, 2014 at 10:36 PM, Chanwit Kaewkasi chan...@gmail.comwrote: Hi all, We are a small team doing a research on low-power (and low-cost) ARM clusters. We built

Re: trying to understand job cancellation

2014-03-19 Thread Koert Kuipers
on spark 1.0.0 SNAPSHOT this seems to work. at least so far i have seen no issues yet. On Thu, Mar 6, 2014 at 8:44 AM, Koert Kuipers ko...@tresata.com wrote: its 0.9 snapshot from january running in standalone mode. have these fixed been merged into 0.9? On Thu, Mar 6, 2014 at 12:45 AM

Re: 答复: unable to build spark - sbt/sbt: line 50: killed

2014-03-22 Thread Koert Kuipers
i have found that i am unable to build/test spark with sbt and java6, but using java7 works (and it compiles with java target version 1.6 so binaries are usable from java 6) On Sat, Mar 22, 2014 at 3:11 PM, Bharath Bhushan manku.ti...@outlook.comwrote: Thanks for the reply. It turns out that

Re: Generic types and pair RDDs

2014-04-01 Thread Koert Kuipers
import org.apache.spark.SparkContext._ import org.apache.spark.rdd.RDD import scala.reflect.ClassTag def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)] = { rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) } } On Tue, Apr 1, 2014 at 4:55 PM, Daniel

RDDInfo visibility SPARK-1132

2014-04-07 Thread Koert Kuipers
any reason why RDDInfo suddenly became private in SPARK-1132? we are using it to show users status of rdds

Re: RDDInfo visibility SPARK-1132

2014-04-07 Thread Koert Kuipers
of the developer API. The particular PR that will change this is: https://github.com/apache/spark/pull/274. Cheers, Andrew On Mon, Apr 7, 2014 at 5:05 PM, Koert Kuipers ko...@tresata.com wrote: any reason why RDDInfo suddenly became private in SPARK-1132? we are using it to show users status

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
Author: Andrew Or andrewo...@gmail.com Closes #281 from andrewor14/ui-storage-fix and squashes the following commits: 408585a [Andrew Or] Fix storage UI bug On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.com wrote: got it thanks On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui

assumption that lib_managed is present

2014-04-08 Thread Koert Kuipers
when i start spark-shell i now see ls: cannot access /usr/local/lib/spark/lib_managed/jars/: No such file or directory we do not package a lib_managed with our spark build (never did). maybe the logic in compute-classpath.sh that searches for datanucleus should check for the existence of

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the spark-shell in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.com wrote: i tried again with latest master, which includes

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
sorry, i meant to say: note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the SPARK-APPLICATION-UI in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote: note

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
in the storage tab. 2) Did you run ./make-distribution.sh after you switched to the current master? Xiangrui On Tue, Apr 8, 2014 at 9:33 AM, Koert Kuipers ko...@tresata.com wrote: i tried again with latest master, which includes commit below, but ui page still shows nothing on storage tab

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
./make-distribution.sh to re-compile Spark first. -Xiangrui On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers ko...@tresata.com wrote: sorry, i meant to say: note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the SPARK-APPLICATION-UI in our

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
() The storagelevels you see here are never the ones of my RDDs. and apparently updateRDDInfo never gets called (i had println in there too). On Tue, Apr 8, 2014 at 2:13 PM, Koert Kuipers ko...@tresata.com wrote: yes i am definitely using latest On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yet at same time i can see via our own api: storageInfo: { diskSize: 0, memSize: 19944, numCachedPartitions: 1, numPartitions: 1 } On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.com wrote: i put some println statements in BlockManagerUI

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
. Andrew On Tue, Apr 8, 2014 at 11:26 AM, Koert Kuipers ko...@tresata.com wrote: yet at same time i can see via our own api: storageInfo: { diskSize: 0, memSize: 19944, numCachedPartitions: 1, numPartitions: 1 } On Tue, Apr 8, 2014 at 2:25 PM

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
** _rddInfoMap: Map() On Tue, Apr 8, 2014 at 4:20 PM, Koert Kuipers ko...@tresata.com wrote: 1) at the end of the callback 2) yes we simply expose sc.getRDDStorageInfo to the user via REST 3) yes exactly. we define the RDDs at startup, all of them are cached. from

Re: ui broken in latest 1.0.0

2014-04-19 Thread Koert Kuipers
. Thanks again for reporting this. I will push out a fix shortly. Andrew On Tue, Apr 8, 2014 at 1:30 PM, Koert Kuipers ko...@tresata.com wrote: our one cached RDD in this run has id 3 *** onStageSubmitted ** rddInfo: RDD 2 (2) Storage: StorageLevel

Re: Storage information about an RDD from the API

2014-04-29 Thread Koert Kuipers
SparkContext.getRDDStorageInfo On Tue, Apr 29, 2014 at 12:34 PM, Andras Nemeth andras.nem...@lynxanalytics.com wrote: Hi, Is it possible to know from code about an RDD if it is cached, and more precisely, how many of its partitions are cached in memory and how many are cached on disk? I

Re: Spark: issues with running a sbt fat jar due to akka dependencies

2014-04-29 Thread Koert Kuipers
you need to merge reference.conf files and its no longer an issue. see the Build for for spark itself: case reference.conf = MergeStrategy.concat On Tue, Apr 29, 2014 at 3:32 PM, Shivani Rao raoshiv...@gmail.com wrote: Hello folks, I was going to post this question to spark user group as

Re: Spark: issues with running a sbt fat jar due to akka dependencies

2014-05-02 Thread Koert Kuipers
not work. I specified it in my email already. But I figured a way around it by excluding akka dependencies Shivani On Tue, Apr 29, 2014 at 12:37 PM, Koert Kuipers ko...@tresata.com wrote: you need to merge reference.conf files and its no longer an issue. see the Build for for spark itself

Re: performance improvement on second operation...without caching?

2014-05-03 Thread Koert Kuipers
Hey Matei, Not sure i understand that. These are 2 separate jobs. So the second job takes advantage of the fact that there is map output left somewhere on disk from the first job, and re-uses that? On Sat, May 3, 2014 at 8:29 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Hi Diana, Apart

Re: os buffer cache does not cache shuffle output file

2014-05-10 Thread Koert Kuipers
yes it seems broken. i got only a few emails in last few days On Fri, May 9, 2014 at 7:24 AM, wxhsdp wxh...@gmail.com wrote: is there something wrong with the mailing list? very few people see my thread -- View this message in context:

Re: writing my own RDD

2014-05-11 Thread Koert Kuipers
resending... my email somehow never made it to the user list. On Fri, May 9, 2014 at 2:11 PM, Koert Kuipers ko...@tresata.com wrote: in writing my own RDD i ran into a few issues with respect to stuff being private in spark. in compute i would like to return an iterator that respects task

Re: writing my own RDD

2014-05-11 Thread Koert Kuipers
will do On May 11, 2014 6:44 PM, Aaron Davidson ilike...@gmail.com wrote: You got a good point there, those APIs should probably be marked as @DeveloperAPI. Would you mind filing a JIRA for that ( https://issues.apache.org/jira/browse/SPARK)? On Sun, May 11, 2014 at 11:51 AM, Koert Kuipers

little confused about SPARK_JAVA_OPTS alternatives

2014-05-14 Thread Koert Kuipers
i have some settings that i think are relevant for my application. they are spark.akka settings so i assume they are relevant for both executors and my driver program. i used to do: SPARK_JAVA_OPTS=-Dspark.akka.frameSize=1 now this is deprecated. the alternatives mentioned are: * some

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-05-15 Thread Koert Kuipers
(spark.akka.frameSize, 1).setAppName(...).setMaster(...) val sc = new SparkContext(conf) - Patrick On Wed, May 14, 2014 at 9:09 AM, Koert Kuipers ko...@tresata.com wrote: i have some settings that i think are relevant for my application. they are spark.akka settings so i assume they are relevant

cant get tests to pass anymore on master master

2014-05-16 Thread Koert Kuipers
i used to be able to get all tests to pass. with java 6 and sbt i get PermGen errors (no matter how high i make the PermGen). so i have given up on that. with java 7 i see 1 error in a bagel test and a few in streaming tests. any ideas? see the error in BagelSuite below. [info] - large number

Re: cant get tests to pass anymore on master master

2014-05-16 Thread Koert Kuipers
...@cloudera.com wrote: Since the error concerns a timeout -- is the machine slowish? What about blowing away everything in your local maven repo, do a clean, etc. to rule out environment issues? I'm on OS X here FWIW. On Thu, May 15, 2014 at 5:24 PM, Koert Kuipers ko...@tresata.com wrote: yeah

Re: java serialization errors with spark.files.userClassPathFirst=true

2014-05-16 Thread Koert Kuipers
) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) On Thu, May 15, 2014 at 3:03 PM, Koert Kuipers ko...@tresata.com wrote: when i set spark.files.userClassPathFirst=true, i get java

writing my own RDD

2014-05-16 Thread Koert Kuipers
in writing my own RDD i ran into a few issues with respect to stuff being private in spark. in compute i would like to return an iterator that respects task killing (as HadoopRDD does), but the mechanics for that are inside the private InterruptibleIterator. also the exception i am supposed to

Re: cant get tests to pass anymore on master master

2014-05-16 Thread Koert Kuipers
yeah sure. it is ubuntu 12.04 with jdk1.7.0_40 what else is relevant that i can provide? On Thu, May 15, 2014 at 12:17 PM, Sean Owen so...@cloudera.com wrote: FWIW I see no failures. Maybe you can say more about your environment, etc. On Wed, May 7, 2014 at 10:01 PM, Koert Kuipers ko

Re: java serialization errors with spark.files.userClassPathFirst=true

2014-05-16 Thread Koert Kuipers
(JavaSerializer.scala:60) On Fri, May 16, 2014 at 1:46 PM, Koert Kuipers ko...@tresata.com wrote: after removing all class paramater of class Path from my code, i tried again. different but related eror when i set spark.files.userClassPathFirst=true now i dont even use FileInputFormat directly

Re: java serialization errors with spark.files.userClassPathFirst=true

2014-05-16 Thread Koert Kuipers
by the child and somehow this means the companion objects are reset or something like that because i get NPEs. On Fri, May 16, 2014 at 3:54 PM, Koert Kuipers ko...@tresata.com wrote: ok i think the issue is visibility: a classloader can see all classes loaded by its parent classloader

Re: File present but file not found exception

2014-05-19 Thread Koert Kuipers
why does it need to be local file? why not do some filter ops on hdfs file and save to hdfs, from where you can create rdd? you can read a small file in on driver program and use sc.parallelize to turn it into RDD On May 16, 2014 7:01 PM, Sai Prasanna ansaiprasa...@gmail.com wrote: I found that

life if an executor

2014-05-19 Thread Koert Kuipers
from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx

Re: life if an executor

2014-05-20 Thread Koert Kuipers
, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote: from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx

Re: life if an executor

2014-05-20 Thread Koert Kuipers
: They’re tied to the SparkContext (application) that launched them. Matei On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote: from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave

Re: life if an executor

2014-05-20 Thread Koert Kuipers
, Koert Kuipers ko...@tresata.com wrote: just for my clarification: off heap cannot be java objects, correct? so we are always talking about serialized off-heap storage? On May 20, 2014 1:27 AM, Tathagata Das tathagata.das1...@gmail.com wrote: That's one the main motivation in using Tachyon

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-19 Thread Koert Kuipers
still struggling with SPARK_JAVA_OPTS being deprecated. i am using spark standalone. for example if i have a akka timeout setting that i would like to be applied to every piece of the spark framework (so spark master, spark workers, spark executor sub-processes, spark-shell, etc.). i used to do

Re: trying to understand yarn-client mode

2014-06-19 Thread Koert Kuipers
it to HDFS and specify its location by exporting its location as SPARK_JAR. Kevin Markey On 06/19/2014 11:22 AM, Koert Kuipers wrote: i am trying to understand how yarn-client mode works. i am not using spark-submit, but instead launching a spark job from within my own application

Re: trying to understand yarn-client mode

2014-06-19 Thread Koert Kuipers
://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Jun 19, 2014 at 12:08 PM, Koert Kuipers ko...@tresata.com wrote: db tsai, if in yarn-cluster mode the driver runs inside yarn, how can you do a rdd.collect and bring the results back to your application? On Thu, Jun 19, 2014

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-19 Thread Koert Kuipers
Koert Kuipers ko...@tresata.com: still struggling with SPARK_JAVA_OPTS being deprecated. i am using spark standalone. for example if i have a akka timeout setting that i would like to be applied to every piece of the spark framework (so spark master, spark workers, spark executor sub

Re: trying to understand yarn-client mode

2014-06-20 Thread Koert Kuipers
, Koert Kuipers ko...@tresata.com wrote: i am trying to understand how yarn-client mode works. i am not using Application application_1403117970283_0014 failed 2 times due to AM Container for appattempt_1403117970283_0014_02 exited with exitCode: -1000 due to: File file:/home/koert/test

spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
i noticed that when i submit a job to yarn it mistakenly tries to upload files to local filesystem instead of hdfs. what could cause this? in spark-env.sh i have HADOOP_CONF_DIR set correctly (and spark-submit does find yarn), and my core-site.xml has a fs.defaultFS that is hdfs, not local

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
...@cloudera.com wrote: Hi Koert, Could you provide more details? Job arguments, log messages, errors, etc. On Fri, Jun 20, 2014 at 9:40 AM, Koert Kuipers ko...@tresata.com wrote: i noticed that when i submit a job to yarn it mistakenly tries to upload files to local filesystem instead

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
ok solved it. as it happened in spark/conf i also had a file called core.site.xml (with some tachyone related stuff in it) so thats why it ignored /etc/hadoop/conf/core-site.xml On Fri, Jun 20, 2014 at 3:24 PM, Koert Kuipers ko...@tresata.com wrote: i put some logging statements

Re: Running Spark alongside Hadoop

2014-06-20 Thread Koert Kuipers
for development/testing i think its fine to run them side by side as you suggested, using spark standalone. just be realistic about what size data you can load with limited RAM. On Fri, Jun 20, 2014 at 3:43 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: The ideal way to do that is to use a

Re: Using Spark as web app backend

2014-06-24 Thread Koert Kuipers
run your spark app in client mode together with a spray rest service, that the front end can talk to On Tue, Jun 24, 2014 at 3:12 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, So far, I run my spark jobs with spark-shell or spark-submit command. I'd like to go further and I wonder

Re: Spark's Hadooop Dependency

2014-06-25 Thread Koert Kuipers
libraryDependencies ++= Seq( org.apache.spark %% spark-core % versionSpark % provided exclude(org.apache.hadoop, hadoop-client) org.apache.hadoop % hadoop-client % versionHadoop % provided ) On Wed, Jun 25, 2014 at 11:26 AM, Robert James srobertja...@gmail.com wrote: To add Spark to a SBT

graphx Joining two VertexPartitions with different indexes is slow.

2014-06-25 Thread Koert Kuipers
lately i am seeing a lot of this warning in graphx: org.apache.spark.graphx.impl.ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow. i am using Graph.outerJoinVertices to join in data from a regular RDD (that is co-partitioned). i would like this operation to

why is toBreeze private everywhere in mllib?

2014-07-01 Thread Koert Kuipers
its kind of handy to be able to convert stuff to breeze... is there some other way i am supposed to access that functionality?

Re: MLLib : Math on Vector and Matrix

2014-07-02 Thread Koert Kuipers
i did the second option: re-implemented .toBreeze as .breeze using pimp classes On Wed, Jul 2, 2014 at 5:00 PM, Thunder Stumpges thunder.stump...@gmail.com wrote: I am upgrading from Spark 0.9.0 to 1.0 and I had a pretty good amount of code working with internals of MLLib. One of the big

Re: taking top k values of rdd

2014-07-05 Thread Koert Kuipers
. On the driver you can just top k the combined top k from each partition (assuming you have (object, count) for each top k list). — Sent from Mailbox https://www.dropbox.com/mailbox On Sat, Jul 5, 2014 at 10:17 AM, Koert Kuipers ko...@tresata.com wrote: my initial approach to taking top k values

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-05 Thread Koert Kuipers
thanks for replying. why is joining two vertexrdds without caching slow? what is recomputed unnecessarily? i am not sure what is different here from joining 2 regular RDDs (where nobody seems to recommend to cache before joining i think...) On Thu, Jul 3, 2014 at 10:52 PM, Ankur Dave

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-06 Thread Koert Kuipers
probably a dumb question, but why is reference equality used for the indexes? On Sun, Jul 6, 2014 at 12:43 AM, Ankur Dave ankurd...@gmail.com wrote: When joining two VertexRDDs with identical indexes, GraphX can use a fast code path (a zip join without any hash lookups). However, the check

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-07 Thread Koert Kuipers
you could only do the deep check if the hashcodes are the same and design hashcodes that do not take all elements into account. the alternative seems to be putting cache statements all over graphx, as is currently the case, which is trouble for any long lived application where caching is

tiers of caching

2014-07-07 Thread Koert Kuipers
i noticed that some algorithms such as graphx liberally cache RDDs for efficiency, which makes sense. however it can also leave a long trail of unused yet cached RDDs, that might push other RDDs out of memory. in a long-lived spark context i would like to decide which RDDs stick around. would it

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Koert Kuipers
spark has a setting to put user jars in front of classpath, which should do the trick. however i had no luck with this. see here: https://issues.apache.org/jira/browse/SPARK-1863 On Mon, Jul 7, 2014 at 1:31 PM, Robert James srobertja...@gmail.com wrote: spark-submit includes a spark-assembly

acl for spark ui

2014-07-07 Thread Koert Kuipers
i was testing using the acl for spark ui in secure mode on yarn in client mode. it works great. my spark 1.0.0 configuration has: spark.authenticate = true spark.ui.acls.enable = true spark.ui.view.acls = koert spark.ui.filters =

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-08 Thread Koert Kuipers
do you control your cluster and spark deployment? if so, you can try to rebuild with jetty 9.x On Tue, Jul 8, 2014 at 9:39 AM, Martin Gammelsæter martingammelsae...@gmail.com wrote: Digging a bit more I see that there is yet another jetty instance that is causing the problem, namely the

Re: Purpose of spark-submit?

2014-07-09 Thread Koert Kuipers
not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch

Re: RDD Cleanup

2014-07-09 Thread Koert Kuipers
did you explicitly cache the rdd? we cache rdds and share them between jobs just fine within one context in spark 1.0.x. but we do not use the ooyala job server... On Wed, Jul 9, 2014 at 10:03 AM, premdass premdas...@yahoo.co.in wrote: Hi, I using spark 1.0.0 , using Ooyala Job Server, for

Re: RDD Cleanup

2014-07-09 Thread Koert Kuipers
we simply hold on to the reference to the rdd after it has been cached. so we have a single Map[String, RDD[X]] for cached RDDs for the application On Wed, Jul 9, 2014 at 11:00 AM, premdass premdas...@yahoo.co.in wrote: Hi, Yes . I am caching the RDD's by calling cache method.. May i

Re: Purpose of spark-submit?

2014-07-09 Thread Koert Kuipers
. -Suren On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote: not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop

spark ui on yarn

2014-07-11 Thread Koert Kuipers
I just tested a long lived application (that we normally run in standalone mode) on yarn in client mode. it looks to me like cached rdds are missing in the storage tap of the ui. accessing the rdd storage information via the spark context shows rdds as fully cached but they are missing on

Re: spark ui on yarn

2014-07-12 Thread Koert Kuipers
. Best, On Fri, Jul 11, 2014 at 4:42 PM, Koert Kuipers ko...@tresata.com wrote: I just tested a long lived application (that we normally run in standalone mode) on yarn in client mode. it looks to me like cached rdds are missing in the storage tap of the ui. accessing the rdd storage

Re: spark ui on yarn

2014-07-13 Thread Koert Kuipers
: The UI code is the same in both, but one possibility is that your executors were given less memory on YARN. Can you check that? Or otherwise, how do you know that some RDDs were cached? Matei On Jul 12, 2014, at 4:12 PM, Koert Kuipers ko...@tresata.com wrote: hey shuo, so far all stage links

Re: replacement for SPARK_LIBRARY_PATH ?

2014-07-17 Thread Koert Kuipers
but be aware that spark-defaults.conf is only used if you use spark-submit On Jul 17, 2014 4:29 PM, Zongheng Yang zonghen...@gmail.com wrote: One way is to set this in your conf/spark-defaults.conf: spark.executor.extraLibraryPath /path/to/native/lib The key is documented here:

using shapeless in spark to optimize data layout in memory

2014-07-23 Thread Koert Kuipers
hello all, in case anyone is interested, i just wrote a short blog about using shapeless in spark to optimize data layout in memory. blog is here: http://tresata.com/tresata-open-sources-spark-columnar code is here: https://github.com/tresata/spark-columnar

graphx cached partitions wont go away

2014-07-26 Thread Koert Kuipers
i have graphx queries running inside a service where i collect the results to the driver and do not hold any references to the rdds involved in the queries. my assumption was that with the references gone spark would go and remove the cached rdds from memory (note, i did not cache them, graphx

how to publish spark inhouse?

2014-07-28 Thread Koert Kuipers
hey we used to publish spark inhouse by simply overriding the publishTo setting. but now that we are integrated in SBT with maven i cannot find it anymore. i tried looking into the pom file, but after reading 1144 lines of xml i 1) havent found anything that looks like publishing 2) i feel

Re: how to publish spark inhouse?

2014-07-28 Thread Koert Kuipers
and if i want to change the version, it seems i have to change it in all 23 pom files? mhhh. is it mandatory for these sub-project pom files to repeat that version info? useful? spark$ grep 1.1.0-SNAPSHOT * -r | wc -l 23 On Mon, Jul 28, 2014 at 3:05 PM, Koert Kuipers ko...@tresata.com wrote

Re: how to publish spark inhouse?

2014-07-28 Thread Koert Kuipers
for you by this plugin. Maven requires artifacts to set a version and it can't inherit one. I feel like I understood the reason this is necessary at one point. On Mon, Jul 28, 2014 at 8:33 PM, Koert Kuipers ko...@tresata.com wrote: and if i want to change the version, it seems i have to change

Re: how to publish spark inhouse?

2014-07-29 Thread Koert Kuipers
/spark/blob/master/dev/create-release/create-release.sh#L65 On Mon, Jul 28, 2014 at 12:39 PM, Koert Kuipers ko...@tresata.com wrote: ah ok thanks. guess i am gonna read up about maven-release-plugin then! On Mon, Jul 28, 2014 at 3:37 PM, Sean Owen so...@cloudera.com wrote

spark-submit symlink

2014-08-05 Thread Koert Kuipers
spark-submit doesnt handle being symlinks currently: $ spark-submit /usr/local/bin/spark-submit: line 44: /usr/local/bin/spark-class: No such file or directory /usr/local/bin/spark-submit: line 44: exec: /usr/local/bin/spark-class: cannot execute: No such file or directory to fix i changed the

mllib style

2014-08-11 Thread Koert Kuipers
i was just looking at ALS (mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala) any need all the variables need to be vars and to have all these setters around? it just leads to so much clutter if you really want them to the vars it is safe in scala to make them public

SchemaRDD

2014-08-27 Thread Koert Kuipers
i feel like SchemaRDD has usage beyond just sql. perhaps it belongs in core?

Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Koert Kuipers
matei, it is good to hear that the restriction that keys need to fit in memory no longer applies to combineByKey. however join requiring keys to fit in memory is still a big deal to me. does it apply to both sides of the join, or only one (while othe other side is streaming)? On Sat, Aug 30,

SPARK_MASTER_IP

2014-09-13 Thread Koert Kuipers
a grep for SPARK_MASTER_IP shows that sbin/start-master.sh and sbin/start-slaves.sh are the only ones that use it. yet for example in CDH5 the spark-master is started from /etc/init.d/spark-master by running bin/spark-class. does that means SPARK_MASTER_IP is simply ignored? it looks like that to

Found both spark.driver.extraClassPath and SPARK_CLASSPATH

2014-09-15 Thread Koert Kuipers
in spark 1.1.0 i get this error: 2014-09-14 23:17:01 ERROR actor.OneForOneStrategy: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former. i checked my application. i do not set spark.driver.extraClassPath or SPARK_CLASSPATH. SPARK_CLASSPATH is set in spark-env.sh

Re: SPARK_MASTER_IP

2014-09-15 Thread Koert Kuipers
hey mark, you think that this is on purpose, or is it an omission? thanks, koert On Mon, Sep 15, 2014 at 8:32 PM, Mark Grover m...@apache.org wrote: Hi Koert, I work on Bigtop and CDH packaging and you are right, based on my quick glance, it doesn't seem to be used. Mark From: Koert

Re: Adjacency List representation in Spark

2014-09-18 Thread Koert Kuipers
we build our own adjacency lists as well. the main motivation for us was that graphx has some assumptions about everything fitting in memory (it has .cache statements all over place). however if my understanding is wrong and graphx can handle graphs that do not fit in memory i would be interested

secondary sort

2014-09-20 Thread Koert Kuipers
now that spark has a sort based shuffle, can we expect a secondary sort soon? there are some use cases where getting a sorted iterator of values per key is helpful.

Re: Found both spark.driver.extraClassPath and SPARK_CLASSPATH

2014-09-21 Thread Koert Kuipers
. On Mon, Sep 15, 2014 at 11:16 AM, Koert Kuipers ko...@tresata.com wrote: in spark 1.1.0 i get this error: 2014-09-14 23:17:01 ERROR actor.OneForOneStrategy: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former. i checked my application. i do not set

in memory assumption in cogroup?

2014-09-29 Thread Koert Kuipers
apologies for asking yet again about spark memory assumptions, but i cant seem to keep it in my head. if i use PairRDDFunctions.cogroup, it returns for every key 2 iterables. do the contents of these iterables have to fit in memory? or is the data streamed?

run scalding on spark

2014-10-01 Thread Koert Kuipers
well, sort of! we make input/output formats (cascading taps, scalding sources) available in spark, and we ported the scalding fields api to spark. so it's for those of us that have a serious investment in cascading/scalding and want to leverage that in spark. blog is here:

Re: run scalding on spark

2014-10-01 Thread Koert Kuipers
thanks On Wed, Oct 1, 2014 at 4:56 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Pretty cool, thanks for sharing this! I've added a link to it on the wiki: https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects . Matei On Oct 1, 2014, at 1:41 PM, Koert Kuipers ko

Re: com.esotericsoftware.kryo.KryoException: Buffer overflow.

2014-10-21 Thread Koert Kuipers
you ran out of kryo buffer. are you using spark 1.1 (which supports buffer resizing) or spark 1.0 (which has a fixed size buffer)? On Oct 21, 2014 5:30 PM, nitinkak001 nitinkak...@gmail.com wrote: I am running a simple rdd filter command. What does it mean? Here is the full stack trace(and code

Re: combine rdds?

2014-10-27 Thread Koert Kuipers
this requires evaluation of the rdd to do the count. val x: RDD[X] = ... val y: RDD[X] = ... x.cache val z = if(x.count thres) x.union(y) else x On Oct 27, 2014 7:51 PM, Josh J joshjd...@gmail.com wrote: Hi, How could I combine rdds? I would like to combine two RDDs if the count in an RDD is

Re: Is Spark the right tool?

2014-10-28 Thread Koert Kuipers
spark can definitely very quickly answer queries like give me all transactions with property x. and you can put a http query server in front of it and run queries concurrently. but spark does not support inserts, updates, or fast random access lookups. this is because RDDs are immutable and

Re: Keep state inside map function

2014-10-28 Thread Koert Kuipers
doing cleanup in an iterator like that assumes the iterator always gets fully read, which is not necessary the case (for example RDD.take does not). instead i would use mapPartitionsWithContext, in which case you can write a function of the form. f: (TaskContext, Iterator[T]) = Iterator[U] now

Re: Java api overhead?

2014-10-29 Thread Koert Kuipers
since spark holds data structures on heap (and by default tries to work with all data in memory) and its written in Scala seeing lots of scala Tuple2 is not unexpected. how do these numbers relate to your data size? On Oct 27, 2014 2:26 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi, I wanted

  1   2   3   4   5   >