java.lang.NoClassDefFoundError: org/apache/spark/deploy/worker/Worker
Hi, all *Spark version: bae07e3 [behind 1] fix different versions of commons-lang dependency and apache/spark#746 addendum* I have six worker nodes and four of them have this NoClassDefFoundError when I use thestart-slaves.sh on my driver node. However, running ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://MASTER_IP:PORT on the worker nodes works well. I compile the /spark directory on driver node and distribute to all the worker nodes. Paths on different nodes are identical. Here is the logs from one of four driver nodes. Spark Command: java -cp ::/home/wanghao/spark/conf:/home/wanghao/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.2.0.jar -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://192.168.1.12:7077 --webui-port 8081 Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/deploy/worker/Worker Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.worker.Worker at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) Could not find the main class: org.apache.spark.deploy.worker.Worker. Program will exit. Here is spark-env.sh export SPARK_WORKER_MEMORY=1g export SPARK_MASTER_IP=192.168.1.12 export SPARK_MASTER_PORT=7077 export SPARK_WORKER_CORES=1 export SPARK_WORKER_INSTANCES=2 hosts file: 127.0.0.1 localhost 192.168.1.12sing12 # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters 192.168.1.11 sing11 192.168.1.59 sing59 ### # failed machines ### 192.168.1.122 host122 192.168.1.123 host123 192.168.1.124 host124 192.168.1.125 host125 Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com
Re: breeze DGEMM slow in spark
Hi, xiangrui i check the stderr of worker node, yes it's failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS... what do you mean by include breeze-natives or netlib:all? things i've already done: 1. add breeze and breeze native dependency in sbt build file 2. download all breeze jars to slaves 3. add jars to classpath in slave 4. ln -s libopenblas_nehalemp-r0.2.9.rc2.so libblas.so.3 and add it to LD_LIBRARY_PATH in slave thank you for your help -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p5977.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: breeze DGEMM slow in spark
Hi, xiangrui you said It doesn't work if you put the netlib-native jar inside an assembly jar. Try to mark it provided in the dependencies, and use --jars to include them with spark-submit. -Xiangrui i'am not use an assembly jar which contains every thing, i also mark breeze dependencies provided, and manually download the jars and add them to slave classpath. but doesn't work:( -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p5979.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Configuring Spark for reduceByKey on on massive data sets
Hi Try using *reduceByKeyLocally*. Regards Lukas Nalezenec On Sun, May 18, 2014 at 3:33 AM, Matei Zaharia matei.zaha...@gmail.comwrote: Make sure you set up enough reduce partitions so you don’t overload them. Another thing that may help is checking whether you’ve run out of local disk space on the machines, and turning on spark.shuffle.consolidateFiles to produce fewer files. Finally, there’s been a recent fix in both branch 0.9 and master that reduces the amount of memory used when there are small files (due to extra memory that was being taken by mmap()): https://issues.apache.org/jira/browse/SPARK-1145. You can find this in either the 1.0 release candidates on the dev list, or branch-0.9 in git. Matei On May 17, 2014, at 5:45 PM, Madhu ma...@madhu.com wrote: Daniel, How many partitions do you have? Are they more or less uniformly distributed? We have similar data volume currently running well on Hadoop MapReduce with roughly 30 nodes. I was planning to test it with Spark. I'm very interested in your findings. - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: File list read into single RDD
Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since Spark supports several FS schemes I’m unclear about how much to assume about using the hadoop file systems APIs and conventions. Concretely if I pass a pattern in with a HTTPS file system, will the pattern work? How does Spark implement its storage system? This seems to be an abstraction level beyond what is available in HDFS. In order to preserve that flexibility what APIs should I be using? It would be easy to say, HDFS only and use HDFS APIs but that would seem to limit things. Especially where you would like to read from one cluster and write to another. This is not so easy to do inside the HDFS APIs, or is advanced beyond my knowledge. If I can stick to passing URIs to sc.textFile() I’m ok but if I need to examine the structure of the file system, I’m unclear how I should do it without sacrificing Spark’s flexibility. On Apr 29, 2014, at 12:55 AM, Christophe Préaud christophe.pre...@kelkoo.com wrote: Hi, You can also use any path pattern as defined here: http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29 e.g.: sc.textFile('{/path/to/file1,/path/to/file2}') Christophe. On 29/04/2014 05:07, Nicholas Chammas wrote: Not that I know of. We were discussing it on another thread and it came up. I think if you look up the Hadoop FileInputFormat API (which Spark uses) you'll see it mentioned there in the docs. http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html But that's not obvious. Nick 2014년 4월 28일 월요일, Pat Ferrelpat.fer...@gmail.com 님이 작성한 메시지: Perfect. BTW just so I know where to look next time, was that in some docs? On Apr 28, 2014, at 7:04 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Yep, as I just found out, you can also provide sc.textFile() with a comma-delimited string of all the files you want to load. For example: sc.textFile('/path/to/file1,/path/to/file2') So once you have your list of files, concatenate their paths like that and pass the single string to textFile(). Nick On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel pat.fer...@gmail.com wrote: sc.textFile(URI) supports reading multiple files in parallel but only with a wildcard. I need to walk a dir tree, match a regex to create a list of files, then I’d like to read them into a single RDD in parallel. I understand these could go into separate RDDs then a union RDD can be created. Is there a way to create a single RDD from a URI list? Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: File list read into single RDD
Spark's sc.textFile()https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456 method delegates to sc.hadoopFile(), which uses Hadoop's FileInputFormat.setInputPaths()https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L546call. There is no alternate storage system, Spark just delegates to Hadoop for the .textFile() call. Hadoop can also support multiple URI schemes, not just hdfs:/// paths, so you can use Spark on data in S3 using s3:/// just the same as you would with HDFS. See Apache's documentation on S3https://wiki.apache.org/hadoop/AmazonS3 for more details. As far as interacting with a FileSystem (HDFS or other) to list files, delete files, navigate paths, etc. from your driver program, you should be able to just instantiate a FileSystem object and use the normal Hadoop APIs from there. The Apache getting started docs on reading/writing from Hadoop DFS https://wiki.apache.org/hadoop/HadoopDfsReadWriteExample should work the same for non-HDFS examples too. I do think we could use a little recipe in our documentation to make interacting with HDFS a bit more straightforward. Pat, if you get something that covers your case that you don't mind sharing, we can format it for including in future Spark docs. Cheers! Andrew On Sun, May 18, 2014 at 9:13 AM, Pat Ferrel pat.fer...@gmail.com wrote: Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since Spark supports several FS schemes I’m unclear about how much to assume about using the hadoop file systems APIs and conventions. Concretely if I pass a pattern in with a HTTPS file system, will the pattern work? How does Spark implement its storage system? This seems to be an abstraction level beyond what is available in HDFS. In order to preserve that flexibility what APIs should I be using? It would be easy to say, HDFS only and use HDFS APIs but that would seem to limit things. Especially where you would like to read from one cluster and write to another. This is not so easy to do inside the HDFS APIs, or is advanced beyond my knowledge. If I can stick to passing URIs to sc.textFile() I’m ok but if I need to examine the structure of the file system, I’m unclear how I should do it without sacrificing Spark’s flexibility. On Apr 29, 2014, at 12:55 AM, Christophe Préaud christophe.pre...@kelkoo.com wrote: Hi, You can also use any path pattern as defined here: http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29 e.g.: sc.textFile('{/path/to/file1,/path/to/file2}') Christophe. On 29/04/2014 05:07, Nicholas Chammas wrote: Not that I know of. We were discussing it on another thread and it came up. I think if you look up the Hadoop FileInputFormat API (which Spark uses) you'll see it mentioned there in the docs. http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html But that's not obvious. Nick 2014년 4월 28일 월요일, Pat Ferrelpat.fer...@gmail.com 님이 작성한 메시지: Perfect. BTW just so I know where to look next time, was that in some docs? On Apr 28, 2014, at 7:04 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Yep, as I just found out, you can also provide sc.textFile() with a comma-delimited string of all the files you want to load. For example: sc.textFile('/path/to/file1,/path/to/file2') So once you have your list of files, concatenate their paths like that and pass the single string to textFile(). Nick On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel pat.fer...@gmail.com wrote: sc.textFile(URI) supports reading multiple files in parallel but only with a wildcard. I need to walk a dir tree, match a regex to create a list of files, then I’d like to read them into a single RDD in parallel. I understand these could go into separate RDDs then a union RDD can be created. Is there a way to create a single RDD from a URI list? -- Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Passing runtime config to workers?
I see - I didn't realize that scope would work like that. Are you saying that any variable that is in scope of the lambda passed to map will be automagically propagated to all workers? What if it's not explicitly referenced in the map, only used by it. E.g.: def main: settings.setSettings rdd.map(x = F.f(x)) object F { def f(...)... val settings:... } F.f accesses F.settings, like a Singleton. The master sets F.settings before using F.f in a map. Will all workers have the same F.settings as seen by F.f? On 5/16/14, DB Tsai dbt...@stanford.edu wrote: Since the evn variables in driver will not be passed into workers, the most easy way you can do is refer to the variables directly in workers from driver. For example, val variableYouWantToUse = System.getenv(something defined in env) rdd.map( you can access `variableYouWantToUse` here ) Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, May 16, 2014 at 1:59 PM, Robert James srobertja...@gmail.comwrote: What is a good way to pass config variables to workers? I've tried setting them in environment variables via spark-env.sh, but, as far as I can tell, the environment variables set there don't appear in workers' environments. If I want to be able to configure all workers, what's a good way to do it? For example, I want to tell all workers: USE_ALGO_A or USE_ALGO_B - but I don't want to recompile.
IllegelAccessError when writing to HBase?
Hi, all I tried to write data to HBase in a Spark-1.0 rc8 application, the application is terminated due to the java.lang.IllegalAccessError, Hbase shell works fine, and the same application works with a standalone Hbase deployment java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString at org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:930) at org.apache.hadoop.hbase.protobuf.RequestConverter.buildGetRowOrBeforeRequest(RequestConverter.java:133) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1466) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1236) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1110) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1067) at org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation(AsyncProcess.java:356) at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:301) at org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:955) at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1239) at org.apache.hadoop.hbase.client.HTable.close(HTable.java:1276) at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:112) at org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:720) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Can anyone give some hint to the issue? Best, -- Nan Zhu
Re: IllegelAccessError when writing to HBase?
I tried hbase-0.96.2/0.98.1/0.98.2 HDFS version is 2.3 -- Nan Zhu On Sunday, May 18, 2014 at 4:18 PM, Nan Zhu wrote: Hi, all I tried to write data to HBase in a Spark-1.0 rc8 application, the application is terminated due to the java.lang.IllegalAccessError, Hbase shell works fine, and the same application works with a standalone Hbase deployment java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString at org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:930) at org.apache.hadoop.hbase.protobuf.RequestConverter.buildGetRowOrBeforeRequest(RequestConverter.java:133) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1466) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1236) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1110) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1067) at org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation(AsyncProcess.java:356) at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:301) at org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:955) at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1239) at org.apache.hadoop.hbase.client.HTable.close(HTable.java:1276) at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:112) at org.apache.spark.rdd.PairRDDFunctions.org (http://org.apache.spark.rdd.PairRDDFunctions.org)$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:720) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Can anyone give some hint to the issue? Best, -- Nan Zhu
Re: unsubscribe
Hi Shangyu (and everyone else looking to unsubscribe!), If you'd like to get off this mailing list, please send an email to user *-unsubscribe*@spark.apache.org, not the regular user@spark.apache.org list. How to use the Apache mailing list infrastructure is documented here: https://www.apache.org/foundation/mailinglists.html And the Spark User list specifically can be found here: http://mail-archives.apache.org/mod_mbox/spark-user/ Thanks! Andrew On Sun, May 18, 2014 at 12:39 PM, Shangyu Luo lsy...@gmail.com wrote: Thanks!
Re: Text file and shuffle
I think the shuffle is unavoidable given that the input partitions (probably hadoop input spits in your case) are not arranged in the way of a cogroup job. But maybe you can try: 1) co-partition you data for cogroup: val par = HashPartitioner(128) val big = sc.textFile(..).map(...).partitionBy(par) val small = sc.textFile(...).map(...).partitionBy(par) ... See discussion in https://groups.google.com/forum/#!topic/spark-users/gUyCSoFo5RI 2) since you have 25GB mem on each node, you can use the broadcast variable in spark to distribute the smaller dataset on each node and do cogroup with it. 2014-05-18 4:41 GMT+02:00 Puneet Lakhina puneet.lakh...@gmail.com: Hi, I'm new to spark and I wanted to understand a few things conceptually so that I can optimize my spark job. I have a large text file (~14G, 200k lines). This file is available on each worker node of my spark cluster. The job I run calls sc.textFile(...).flatmap(...) . The function that I pass into flat map splits up each line from the file into a key and value. Now I have another text file which is smaller in size(~1.5G) but has a lot more lines because it has more than one value per key spread across multiple lines. . I call the same textFile and flatmap functions on they other file and then call groupByKey to have all values for a key available as a list. Having done this I then cogroup these 2 RDDs. I have the following questions 1. Is this sequence of steps the best way to achieve what I want, I.e a join across the 2 data sets? 2. I have a 8 node (25 Gb memory each) . The large file flatmap spawns about 400 odd tasks whereas the small file flatmap only spawns about 30 odd tasks. The large file's flatmap takes about 2-3 mins and during this time it seems to do about 3G of shuffle write. I want to understand if this shuffle write is something I can avoid. From what I have read, the shuffle write is a disk write. Is that correct? Also is the reason for the shuffle write the fact that the partitioner for flatmap ends up having to redistribute the data across the cluster? Please let me know if I haven't provided enough information. I'm new to spark so if you see anything fundamental that I don't understand please feel free to just point me to a link that provides some detailed information. Thanks, Puneet -- *JU Han* Data Engineer @ Botify.com +33 061960
First sample with Spark Streaming and three Time's?
Hi, I'm quite new to Spark Streaming and developed the following application to pass 4 strings, process them and shut down: val conf = new SparkConf(false) // skip loading external settings .setMaster(local[1]) // run locally with one thread .setAppName(Spark Streaming with Scala) // name in Spark web UI val ssc = new StreamingContext(conf, Seconds(5)) val stream: ReceiverInputDStream[String] = ssc.receiverStream( new Receiver[String](StorageLevel.MEMORY_ONLY_SER_2) { def onStart() { println([ACTIVATOR] onStart called) store(one) store(two) store(three) store(four) stop(No more data...receiver stopped) } def onStop() { println([ACTIVATOR] onStop called) } } ) stream.count().map(cnt = Received + cnt + events.).print() ssc.start() // ssc.awaitTermination(1000) val stopSparkContext, stopGracefully = true ssc.stop(stopSparkContext, stopGracefully) I'm running it with `xsbt 'runMain StreamingApp'` with xsbt and spark build from the latest sources. What I noticed is that the app generates: 14/05/18 22:32:55 INFO DAGScheduler: Completed ResultTask(1, 0) 14/05/18 22:32:55 INFO DAGScheduler: Stage 1 (take at DStream.scala:593) finished in 0.245 s 14/05/18 22:32:55 INFO SparkContext: Job finished: take at DStream.scala:593, took 4.829798 s --- Time: 140044517 ms --- 14/05/18 22:32:55 INFO DAGScheduler: Completed ResultTask(3, 0) 14/05/18 22:32:55 INFO DAGScheduler: Stage 3 (take at DStream.scala:593) finished in 0.022 s 14/05/18 22:32:55 INFO SparkContext: Job finished: take at DStream.scala:593, took 0.194738 s --- Time: 1400445175000 ms --- 14/05/18 22:33:00 INFO DAGScheduler: Completed ResultTask(5, 0) 14/05/18 22:33:00 INFO DAGScheduler: Stage 5 (take at DStream.scala:593) finished in 0.014 s 14/05/18 22:33:00 INFO SparkContext: Job finished: take at DStream.scala:593, took 0.319387 s --- Time: 140044518 ms --- Why are there three jobs finished? I would expect one since after `store` the app immediately calls `stop`. Can I have a single job that would process these 4 `store`s? Jacek -- Jacek Laskowski | http://blog.japila.pl Never discourage anyone who continually makes progress, no matter how slow. Plato
Re: breeze DGEMM slow in spark
ok Spark Executor Command: java -cp :/root/ephemeral-hdfs/conf:/root/.ivy2/cache/org.scala-lang/scala-library/jars/scala-library-2.10.4.jar:/root/.ivy2/cache/org.scalanlp/breeze_2.10/jars/breeze_2.10-0.7.jar:/root/.ivy2/cache/org.scalanlp/breeze-macros_2.10/jars/breeze-macros_2.10-0.3.jar:/root/.sbt/boot/scala-2.10.3/lib/scala-reflect.jar:/root/.ivy2/cache/com.thoughtworks.paranamer/paranamer/jars/paranamer-2.2.jar:/root/.ivy2/cache/com.github.fommil.netlib/core/jars/core-1.1.2.jar:/root/.ivy2/cache/net.sourceforge.f2j/arpack_combined_all/jars/arpack_combined_all-0.1.jar:/root/.ivy2/cache/net.sourceforge.f2j/arpack_combined_all/jars/arpack_combined_all-0.1-javadoc.jar:/root/.ivy2/cache/net.sf.opencsv/opencsv/jars/opencsv-2.3.jar:/root/.ivy2/cache/com.github.rwl/jtransforms/jars/jtransforms-2.4.0.jar:/root/.ivy2/cache/junit/junit/jars/junit-4.8.2.jar:/root/.ivy2/cache/org.apache.commons/commons-math3/jars/commons-math3-3.2.jar:/root/.ivy2/cache/org.spire-math/spire_2.10/jars/spire_2.10-0.7.1.jar:/root/.ivy2/cache/org.spire-math/spire-macros_2.10/jars/spire-macros_2.10-0.7.1.jar:/root/.ivy2/cache/com.typesafe/scalalogging-slf4j_2.10/jars/scalalogging-slf4j_2.10-1.0.1.jar:/root/.ivy2/cache/org.slf4j/slf4j-api/jars/slf4j-api-1.7.2.jar:/root/.ivy2/cache/org.scalanlp/breeze-natives_2.10/jars/breeze-natives_2.10-0.7.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/jars/netlib-native_ref-osx-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/native_ref-java/jars/native_ref-java-1.1.jar:/root/.ivy2/cache/com.github.fommil/jniloader/jars/jniloader-1.1.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/jars/netlib-native_ref-linux-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-i686/jars/netlib-native_ref-linux-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-win-x86_64/jars/netlib-native_ref-win-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-win-i686/jars/netlib-native_ref-win-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-armhf/jars/netlib-native_ref-linux-armhf-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-osx-x86_64/jars/netlib-native_system-osx-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/native_system-java/jars/native_system-java-1.1.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-x86_64/jars/netlib-native_system-linux-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-i686/jars/netlib-native_system-linux-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-armhf/jars/netlib-native_system-linux-armhf-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-win-x86_64/jars/netlib-native_system-win-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-win-i686/jars/netlib-native_system-win-i686-1.1-natives.jar ::/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar -Xms4096M -Xmx4096M -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p5994.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Passing runtime config to workers?
When you reference any variable outside the executor's scope, spark will automatically serialize them in the driver, and send them to executors, which implies, those variables have to implement serializable. For the example you mention, the Spark will serialize object F, and if it's not serializable, it will raise exception. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, May 18, 2014 at 12:58 PM, Robert James srobertja...@gmail.comwrote: I see - I didn't realize that scope would work like that. Are you saying that any variable that is in scope of the lambda passed to map will be automagically propagated to all workers? What if it's not explicitly referenced in the map, only used by it. E.g.: def main: settings.setSettings rdd.map(x = F.f(x)) object F { def f(...)... val settings:... } F.f accesses F.settings, like a Singleton. The master sets F.settings before using F.f in a map. Will all workers have the same F.settings as seen by F.f? On 5/16/14, DB Tsai dbt...@stanford.edu wrote: Since the evn variables in driver will not be passed into workers, the most easy way you can do is refer to the variables directly in workers from driver. For example, val variableYouWantToUse = System.getenv(something defined in env) rdd.map( you can access `variableYouWantToUse` here ) Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, May 16, 2014 at 1:59 PM, Robert James srobertja...@gmail.comwrote: What is a good way to pass config variables to workers? I've tried setting them in environment variables via spark-env.sh, but, as far as I can tell, the environment variables set there don't appear in workers' environments. If I want to be able to configure all workers, what's a good way to do it? For example, I want to tell all workers: USE_ALGO_A or USE_ALGO_B - but I don't want to recompile.
sync master with slaves with bittorrent?
I am launching a rather large cluster on ec2. It seems like the launch is taking forever on Setting up spark RSYNC'ing /root/spark to slaves... ... It seems that bittorrent might be a faster way to replicate the sizeable spark directory to the slaves particularly if there is a lot of not very powerful slaves. Just a thought ... cheers Daniel
Re: sync master with slaves with bittorrent?
Out of curiosity, do you have a library in mind that would make it easy to setup a bit torrent network and distribute files in an rsync (i.e., apply a diff to a tree, ideally) fashion? I'm not familiar with this space, but we do want to minimize the complexity of our standard ec2 launch scripts to reduce the chance of something breaking. On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler dmah...@gmail.com wrote: I am launching a rather large cluster on ec2. It seems like the launch is taking forever on Setting up spark RSYNC'ing /root/spark to slaves... ... It seems that bittorrent might be a faster way to replicate the sizeable spark directory to the slaves particularly if there is a lot of not very powerful slaves. Just a thought ... cheers Daniel
Re: sync master with slaves with bittorrent?
I am not an expert in this space either. I thought the initial rsync during launch is really just a straight copy that did not need the tree diff. So it seemed like having the slaves do the copying among it each other would be better than having the master copy to everyone directly. That made me think of bittorrent, though there may well be other systems that do this. From the launches I did today it seems that it is taking around 1 minute per slave to launch a cluster, which can be a problem for clusters with 10s or 100s of slaves, particularly since on ec2 that time has to be paid for. On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson ilike...@gmail.com wrote: Out of curiosity, do you have a library in mind that would make it easy to setup a bit torrent network and distribute files in an rsync (i.e., apply a diff to a tree, ideally) fashion? I'm not familiar with this space, but we do want to minimize the complexity of our standard ec2 launch scripts to reduce the chance of something breaking. On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler dmah...@gmail.com wrote: I am launching a rather large cluster on ec2. It seems like the launch is taking forever on Setting up spark RSYNC'ing /root/spark to slaves... ... It seems that bittorrent might be a faster way to replicate the sizeable spark directory to the slaves particularly if there is a lot of not very powerful slaves. Just a thought ... cheers Daniel