Re: how to save RDD partitions in different folders?
Can you provide an example? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754p3823.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
hang on sorting operation
I am seeing a small standalone cluster (master, slave) hang when I reach a certain memory threshold, but I cannot detect how to configure memory to avoid this. I added memory by configuring SPARK_DAEMON_MEMORY=2G and I can see this allocated, but it does not help. The reduce is by key to get the counts by key: rdd = sc.parallelize(self.phrases) # do a distributed count using reduceByKey counts = rdd.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y) # reverse the (key, count) pairs into (count, key) and then sort in descending order sorted = counts.map(lambda (key, count): (count, key)).sortByKey(False) Below is the log to the point of hanging: 14/04/06 19:39:15 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (PairwiseRDD[2] at reduceByKey) 14/04/06 19:39:15 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks 14/04/06 19:39:15 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@localhost:64370/user/Executor#-2031138316] with ID 0 14/04/06 19:39:15 INFO TaskSetManager: Starting task 1.0:0 as TID 0 on executor 0: localhost (PROCESS_LOCAL) 14/04/06 19:39:15 INFO TaskSetManager: Serialized task 1.0:0 as 10417848 bytes in 18 ms 14/04/06 19:39:15 INFO TaskSetManager: Starting task 1.0:1 as TID 1 on executor 0: localhost (PROCESS_LOCAL) 14/04/06 19:39:15 INFO TaskSetManager: Serialized task 1.0:1 as 10571697 bytes in 13 ms 14/04/06 19:39:15 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager localhost:64375 with 294.9 MB RAM 14/04/06 19:39:16 INFO TaskSetManager: Finished TID 0 in 1397 ms on localhost (progress: 0/2) 14/04/06 19:39:16 INFO DAGScheduler: Completed ShuffleMapTask(1, 0) When I interrupt the running program, here is the stack trace, which appears stuck after the reduce in the sorting by count in descending order: sorted = counts.map(lambda (key, count): (count, key)).sortByKey(False) File /Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py, line 361, in sortByKey rddSize = self.count() File /Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py, line 542, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py, line 533, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py, line 499, in reduce vals = self.mapPartitions(func).collect() File /Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py, line 463, in collect bytesInJava = self._jrdd.collect().iterator() File /Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 535, in __call__ File /Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 363, in send_command File /Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 472, in send_command File /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py, line 430, in readline data = recv(1) KeyboardInterrupt Is there a reason why the sorting gets stuck? I can easily remove the problem by reducing the size of the RDD below the threshold of about 800,000 items prior to the reduce is run. It would help to see where resources like memory are depleted, but this does not show up in the console. Many thanks, Stuart Zakon
Recommended way to develop spark application with both java and python
Dear all, We have a spark 0.8.1 cluster on mesos 0.15. Some of my colleagues are familiar with python, but some of features are developed under java. I am looking for a way to integrate java and python on spark. I notice that the initialization of pyspark does not include a field to distribute jar files to slaves. After exploring the source code and do some hacking, I could control the java sparkcontext object through py4j, but the jar files are not delivered to slaves. Moreover, it seems that the spark lauch the process through the spark home on pyspark but through the spark.executor.uri on scala. Is there a recommended way to develop spark application with both java/scala and python? Should I suggest my team to unify the language? Thanks!
Re: Spark Disk Usage
Hi, Any thoughts on this? Thanks. -Suren On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Hi, I know if we call persist with the right options, we can have Spark persist an RDD's data on disk. I am wondering what happens in intermediate operations that could conceivably create large collections/Sequences, like GroupBy and shuffling. Basically, one part of the question is when is disk used internally? And is calling persist() on the RDD returned by such transformations what let's it know to use disk in those situations? Trying to understand if persist() is applied during the transformation or after it. Thank you. SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Sample Project for using Shark API in Spark programs
Hi Shark, Should I assume that Shark users should not use the shark APIs since there are no documentations for it? If there are documentations, can you point it out? Best Regards, Jerry On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam chiling...@gmail.com wrote: Hello everyone, I have successfully installed Shark 0.9 and Spark 0.9 in standalone mode in a cluster of 6 nodes for testing purposes. I would like to use Shark API in Spark programs. So far I could only find the following: $./bin/shark-shell scala val youngUsers = sc.sql2rdd(SELECT * FROM users WHERE age 20) scala println(youngUsers.count) ... scala val featureMatrix = youngUsers.map(extractFeatures(_)) scala kmeans(featureMatrix) Is there a more complete sample code to start a program using Shark API in Spark? Thanks! Jerry
Require some clarity on partitioning
Hi, I was going through Matei's Advanced Spark presentation at https://www.youtube.com/watch?v=w0Tisli7zn4 , and had few questions. The presentation of this video is at http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf The PageRank example introduces partitioning in the below way: val ranks = // RDD of (url, rank) pairs val links = sc.textFile(...).map(...).partitionBy(new HashPartitioner(8)) However later on, it is said that 1) Any shuffle operation on two RDDs will take on the Partitioner of one of them, if one is set Question1: Could we have applied partitionBy on the ranks RDD and have the same result/performance ? 2) Otherwise, by default use HashPartitioner Question2: If partitionBy applies HashPartitioner in this example, could we simply not have any partitioner and relied on the default HashPartitioner to achieve the same result/performance ? I had another question unrelated to this presentation. Question3: If my processing is something like this rdd3 = rdd1.join(rdd2) rdd4 = rdd3.map((k,(v1,v2))=(v1,k)) rdd6 = rdd4.join(rdd5) rdd6.saveAsTextFiles(out.txt) Would I benefit by partitioning ? Unlike the PageRank example, I do not have to join/shuffle the same RDD or key more than once. Regards, Sanjay
Null Pointer Exception in Spark Application with Yarn Client Mode
Hi All, I wanted Spark on Yarn to up and running. I did *SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt assembly* Then i ran *SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar SPARK_YARN_APP_JAR=examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.1-incubating.jar MASTER=yarn-client ./spark-shell* I have SPARK_HOME, YARN_CONF_DIR/HADOOP_CONF_DIR set. Still i get the following error. Any clues ?? *ERROR:* *...Using Scala version 2.9.3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)Initializing interpreter...Creating SparkContext...java.lang.NullPointerException at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:115) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps.foreach(ArrayOps.scala:38) at org.apache.spark.deploy.yarn.Client$.populateHadoopClasspath(Client.scala:489) at org.apache.spark.deploy.yarn.Client$.populateClasspath(Client.scala:510) at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:327) at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:90) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:71) at org.apache.spark.scheduler.cluster.ClusterScheduler.start(ClusterScheduler.scala:119) at org.apache.spark.SparkContext.init(SparkContext.scala:273) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:862) at init(console:10) at init(console:22) at init(console:24) at .init(console:28) at .clinit(console) at .init(console:7) at .clinit(console) at $export(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:629) at org.apache.spark.repl.SparkIMain$Request$$anonfun$10.apply(SparkIMain.scala:897) at scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43) at scala.tools.nsc.io.package$$anon$2.run(package.scala:25) at java.lang.Thread.run(Thread.java:744)* *.*
Re: How to create a RPM package
For issue #2 I was concerned that the build packaging had to be internal. So I am using the already packaged make-distribution.sh (modified to use a maven build) to create a tar ball which I then package it using a RPM spec file. Hi Rahul, so the issue for downstream operating system distributions is that they basically need to be able to build everything from source. So while you're building Spark itself locally, a package that appears in Fedora or Debian will go a bit further and also build all of the library dependencies, compilers, etc., from sources (possibly with some exceptions for bootstrapping compilers that can't be built without themselves) rather than pulling down binaries from a public Maven or Ivy repository. Although on a side note, it would interesting to learn how the list of files does not need to be maintained in the spec file (the spec file that Christophe attached was using a explicit list). If you take a look at the spec that is in Fedora (http://pkgs.fedoraproject.org/cgit/spark.git/tree/spark.spec, starting at line 257 in the current revision), you'll see the following %files section: %files -f .mfiles %dir %{_javadir}/%{name} %doc LICENSE README.md %files javadoc %{_javadocdir}/%{name} %doc LICENSE As you can see, this list is pretty generic (and obviously doesn't include all the files in the package). The list of JAR and POM files is provided in a file called .mfiles (which is automatically generated by macros) and we just have to specify the directories that the package owns (which aren't picked up by the macros at this time), the license, and the README. best, wb
PySpark SocketConnect Issue in Cluster
Hi, We have a situation where a Pyspark script works fine as a local process (local url) on the Master and the Worker nodes, which would indicate that all python dependencies are set up properly on each machine. But when we try to run the script at the cluster level (using the master's url), if fails partway through the flow on a GroupBy with a SocketConnect error and python crashes. This is on ec2 using the AMI. This doesn't seem to be an issue of the master not seeing the workers, since they show up in the web ui. Also, we can see the job running on the cluster until it reaches the GroupBy transform step, which is when we get the SocketConnect error. Any ideas? -Suren SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Spark Disk Usage
It might help if I clarify my questions. :-) 1. Is persist() applied during the transformation right before the persist() call in the graph? Or is is applied after the transform's processing is complete? In the case of things like GroupBy, is the Seq backed by disk as it is being created? We're trying to get a sense of how the processing is handled behind the scenes with respect to disk. 2. When else is disk used internally? Any pointers are appreciated. -Suren On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Hi, Any thoughts on this? Thanks. -Suren On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Hi, I know if we call persist with the right options, we can have Spark persist an RDD's data on disk. I am wondering what happens in intermediate operations that could conceivably create large collections/Sequences, like GroupBy and shuffling. Basically, one part of the question is when is disk used internally? And is calling persist() on the RDD returned by such transformations what let's it know to use disk in those situations? Trying to understand if persist() is applied during the transformation or after it. Thank you. SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: reduceByKeyAndWindow Java
Hi TD Could you explain me this code part? .reduceByKeyAndWindow( 109 new Function2Integer, Integer, Integer() { 110 public Integer call(Integer i1, Integer i2) { return i1 + i2; } 111 }, 112 new Function2Integer, Integer, Integer() { 113 public Integer call(Integer i1, Integer i2) { return i1 - i2; } 114 }, 115 new Duration(60 * 5 * 1000), 116 new Duration(1 * 1000) 117 ); Thanks Em 4/4/14, 22:56, Tathagata Das escreveu: I havent really compiled the code, but it looks good to me. Why? Is there any problem you are facing? TD On Fri, Apr 4, 2014 at 8:03 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it mailto:e.costaalf...@unibs.it wrote: Hi guys, I would like knowing if the part of code is right to use in Window. JavaPairDStreamString, Integer wordCounts = words.map( 103 new PairFunctionString, String, Integer() { 104 @Override 105 public Tuple2String, Integer call(String s) { 106 return new Tuple2String, Integer(s, 1); 107 } 108 }).reduceByKeyAndWindow( 109 new Function2Integer, Integer, Integer() { 110 public Integer call(Integer i1, Integer i2) { return i1 + i2; } 111 }, 112 new Function2Integer, Integer, Integer() { 113 public Integer call(Integer i1, Integer i2) { return i1 - i2; } 114 }, 115 new Duration(60 * 5 * 1000), 116 new Duration(1 * 1000) 117 ); Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155 -- Informativa sulla Privacy: http://www.unibs.it/node/8155
AWS Spark-ec2 script with different user
Hi all, On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh. Also, it is the default user for the Spark-EC2 script. Currently, the Amazon Linux images have an 'ec2-user' set up for ssh instead of 'root'. I can see that the Spark-EC2 script allows you to specify which user to log in with, but even when I change this, the script fails for various reasons. And the output SEEMS that the script is still based on the specified user's home directory being '/root'. Am I using this script wrong? Has anyone had success with this 'ec2-user' user? Any ideas? Please and thank you, Marco.
SparkContext.addFile() and FileNotFoundException
Hi, I'm trying to use SparkContext.addFile() to propagate a file to worker nodes, in a standalone cluster (2 nodes, 1 master, 1 worker connected to the master). I don't have HDFS or any distributed file system. Just playing with basic stuff. Here's the code in my driver (actually spark-shell running on the master node). In the current directory I have file spam.data The following commands are taken from the book http://www.packtpub.com/fast-data-processing-with-spark/book , page 44 *scala sc.addFile(spam.data)* 14/04/07 14:03:48 INFO Utils: Copying /home/thierry/dev/spark-samples/packt-book/LoadSaveExample/spam.data to /tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.data 14/04/07 14:03:49 INFO SparkContext: Added file spam.data at http://192.168.1.51:59008/files/spam.data with timestamp 1396893828972 *scala import org.apache.spark.SparkFiles* import org.apache.spark.SparkFiles *scala val inFile = sc.textFile(SparkFiles.get(spam.data))* 14/04/07 14:05:00 INFO MemoryStore: ensureFreeSpace(138763) called with curMem=0, maxMem=311387750 14/04/07 14:05:00 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 135.5 KB, free 296.8 MB) inFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at console:13 Now trigger some action to make the worker work. *scala inFile.count()* In the stderr.log of the app on the worker : 14/04/07 14:05:33 INFO Executor: Fetching http://192.168.1.51:59008/files/spam.data with timestamp 1396893828972 14/04/07 14:05:33 INFO Utils: Fetching http://192.168.1.51:59008/files/spam.data to /tmp/fetchFileTemp435286457200696761.tmp So apparently the file was successfully downloaded from the driver to the worker. The jar of the application is also successfully downloaded. But a bit later, in the same stderr.log: 14/04/07 14:05:34 INFO HttpBroadcast: Reading broadcast variable 0 took 0.352334273 s 14/04/07 14:05:34 INFO HadoopRDD: Input split: file:/tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.data:0+349170 14/04/07 14:05:34 ERROR Executor: Exception in task ID 0 java.io.FileNotFoundException: File file:/tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.data does not exist at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:137) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763) at org.apache.hadoop.mapred.LineRecordReader.init(LineRecordReader.java:106) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:156) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) It looks like the file is looked for in: /tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.dat which is the temp location on the master node where the driver is running, while it was downloaded in the worker node in /tmp/fetchFileTemp435286457200696761.tmp I see hadoop related classes in the stack trace. Does it mean HDFS is used ? If that's the case, is it because I'm using the precompiled spark-0.9.0-incubating-bin-hadoop2 ? I couldn't find any response, neither in the spark user list, nor by googling it or in the spark guides (sorry for that probably very basic question) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-addFile-and-FileNotFoundException-tp3844.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Status of MLI?
That work is under submission at an academic conference and will be made available if/when the paper is published. In terms of algorithms for hyperparameter tuning, we consider Grid Search, Random Search, a couple of older derivative-free optimization methods, and a few newer methods - TPE (aka HyperOpt from James Bergstra), SMAC (from Frank Hutter's group), and Spearmint (Jasper Snoek's method) - the short answer is that in our hands Random Search works surprisingly well for the low-dimensional problems we looked at, but TPE and SMAC perform slightly better. I've got a private branch with TPE (as well as random and grid search) integrated with MLI, but the code is research quality right now and not extremely general. We're actively working on bringing these things up to snuff for a proper open source release. On Fri, Apr 4, 2014 at 11:28 AM, Yi Zou yi.zou.li...@gmail.com wrote: Hi, Evan, Just noticed this thread, do you mind sharing more details regarding algorithms targetted at hyperparameter tuning/model selection? or a link to dev git repo for that work. thanks, yi On Wed, Apr 2, 2014 at 6:03 PM, Evan R. Sparks evan.spa...@gmail.comwrote: Targeting 0.9.0 should work out of the box (just a change to the build.sbt) - I'll push some changes I've been sitting on to the public repo in the next couple of days. On Wed, Apr 2, 2014 at 4:05 AM, Krakna H shankark+...@gmail.com wrote: Thanks for the update Evan! In terms of using MLI, I see that the Github code is linked to Spark 0.8; will it not work with 0.9 (which is what I have set up) or higher versions? On Wed, Apr 2, 2014 at 1:44 AM, Evan R. Sparks [via Apache Spark User List] [hidden email]http://user/SendEmail.jtp?type=nodenode=3632i=0 wrote: Hi there, MLlib is the first component of MLbase - MLI and the higher levels of the stack are still being developed. Look for updates in terms of our progress on the hyperparameter tuning/model selection problem in the next month or so! - Evan On Tue, Apr 1, 2014 at 8:05 PM, Krakna H [hidden email]http://user/SendEmail.jtp?type=nodenode=3615i=0 wrote: Hi Nan, I was actually referring to MLI/MLBase (http://www.mlbase.org); is this being actively developed? I'm familiar with mllib and have been looking at its documentation. Thanks! On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=3612i=0wrote: mllib has been part of Spark distribution (under mllib directory), also check http://spark.apache.org/docs/latest/mllib-guide.html and for JIRA, because of the recent migration to apache JIRA, I think all mllib-related issues should be under the Spark umbrella, https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel -- Nan Zhu On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote: What is the current development status of MLI/MLBase? I see that the github repo is lying dormant (https://github.com/amplab/MLI) and JIRA has had no activity in the last 30 days ( https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel). Is the plan to add a lot of this into mllib itself without needing a separate API? Thanks! -- View this message in context: Status of MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html To start a new topic under Apache Spark User List, email [hidden email] http://user/SendEmail.jtp?type=nodenode=3612i=1 To unsubscribe from Apache Spark User List, click here. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: Re: Status of MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3612.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3615.html To start a new topic under Apache Spark User List, email [hidden email] http://user/SendEmail.jtp?type=nodenode=3632i=1
Re: Sample Project for using Shark API in Spark programs
I might be wrong here but I don't believe it's discouraged. Maybe part of the reason there's not a lot of examples is that sql2rdd returns an RDD (TableRDD that is https://github.com/amplab/shark/blob/master/src/main/scala/shark/SharkContext.scala). I haven't done anything too complicated yet but my impression is that almost any Spark example of manipulating RDDs should applying from that line onwards. Are you asking for samples what to do with the RDD once you get it or how to get a SharkContext from a standalone program? Also, my reading of a recent email on this list is that SharkAPI will be largely superceded by a more general SparkSQL API in 1.0 (http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html). So if you're just starting out and you don't have short term needs that might be a better place to start... On Mon, Apr 7, 2014 at 9:14 AM, Jerry Lam chiling...@gmail.com wrote: Hi Shark, Should I assume that Shark users should not use the shark APIs since there are no documentations for it? If there are documentations, can you point it out? Best Regards, Jerry On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam chiling...@gmail.com wrote: Hello everyone, I have successfully installed Shark 0.9 and Spark 0.9 in standalone mode in a cluster of 6 nodes for testing purposes. I would like to use Shark API in Spark programs. So far I could only find the following: $./bin/shark-shell scala val youngUsers = sc.sql2rdd(SELECT * FROM users WHERE age 20) scala println(youngUsers.count) ... scala val featureMatrix = youngUsers.map(extractFeatures(_)) scala kmeans(featureMatrix) Is there a more complete sample code to start a program using Shark API in Spark? Thanks! Jerry
Re: AWS Spark-ec2 script with different user
Hi Shivaram, OK so let's assume the script CANNOT take a different user and that it must be 'root'. The typical workaround is as you said, allow the ssh with the root user. Now, don't laugh, but, this worked last Friday, but today (Monday) it no longer works. :D Why? ... ...It seems that NOW, when you launch a 'paravirtual' ami, the root user's 'authorized_keys' file is always overwritten. This means the workaround doesn't work anymore! I would LOVE for someone to verify this. Just to point out, I am trying to make this work with a paravirtual instance and not an HVM instance. Please and thanks, Marco. On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman shivaram.venkatara...@gmail.com wrote: Right now the spark-ec2 scripts assume that you have root access and a lot of internal scripts assume have the user's home directory hard coded as /root. However all the Spark AMIs we build should have root ssh access -- Do you find this not to be the case ? You can also enable root ssh access in a vanilla AMI by editing /etc/ssh/sshd_config and setting PermitRootLogin to yes Thanks Shivaram On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini silvio.costant...@granatads.com wrote: Hi all, On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh. Also, it is the default user for the Spark-EC2 script. Currently, the Amazon Linux images have an 'ec2-user' set up for ssh instead of 'root'. I can see that the Spark-EC2 script allows you to specify which user to log in with, but even when I change this, the script fails for various reasons. And the output SEEMS that the script is still based on the specified user's home directory being '/root'. Am I using this script wrong? Has anyone had success with this 'ec2-user' user? Any ideas? Please and thank you, Marco.
Driver Out of Memory
Hi Guys, I would like understanding why the Driver's RAM goes down, Does the processing occur only in the workers? Thanks # Start Tests computer1(Worker/Source Stream) 23:57:18 up 12:03, 1 user, load average: 0.03, 0.31, 0.44 total used free sharedbuffers cached Mem: 3945 1084 2860 0 44827 -/+ buffers/cache:212 3732 Swap:0 0 0 computer8 (Driver/Master) 23:57:18 up 11:53, 5 users, load average: 0.43, 1.19, 1.31 total used free sharedbuffers cached Mem: 5897 4430 1466 0 384 2662 -/+ buffers/cache: 1382 4514 Swap:0 0 0 computer10(Worker/Source Stream) 23:57:18 up 12:02, 1 user, load average: 0.55, 1.34, 0.98 total used free sharedbuffers cached Mem: 5897564 5332 0 18358 -/+ buffers/cache:187 5709 Swap:0 0 0 computer11(Worker/Source Stream) 23:57:18 up 12:02, 1 user, load average: 0.07, 0.19, 0.29 total used free sharedbuffers cached Mem: 3945603 3342 0 54355 -/+ buffers/cache:193 3751 Swap:0 0 0 After 2 Minutes computer1 00:06:41 up 12:12, 1 user, load average: 3.11, 1.32, 0.73 total used free sharedbuffers cached Mem: 3945 2950994 0 46 1095 -/+ buffers/cache: 1808 2136 Swap:0 0 0 computer8(Driver/Master) 00:06:41 up 12:02, 5 users, load average: 1.16, 0.71, 0.96 total used free sharedbuffers cached Mem: 5897 5191705 0 385 2792 -/+ buffers/cache: 2014 3882 Swap:0 0 0 computer10 00:06:41 up 12:11, 1 user, load average: 2.02, 1.07, 0.89 total used free sharedbuffers cached Mem: 5897 2567 3329 0 21647 -/+ buffers/cache: 1898 3998 Swap:0 0 0 computer11 00:06:42 up 12:12, 1 user, load average: 3.96, 1.83, 0.88 total used free sharedbuffers cached Mem: 3945 3542402 0 57 1099 -/+ buffers/cache: 2385 1559 Swap:0 0 0 -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: AWS Spark-ec2 script with different user
Hmm -- That is strange. Can you paste the command you are using to launch the instances ? The typical workflow is to use the spark-ec2 wrapper script using the guidelines at http://spark.apache.org/docs/latest/ec2-scripts.html Shivaram On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini silvio.costant...@granatads.com wrote: Hi Shivaram, OK so let's assume the script CANNOT take a different user and that it must be 'root'. The typical workaround is as you said, allow the ssh with the root user. Now, don't laugh, but, this worked last Friday, but today (Monday) it no longer works. :D Why? ... ...It seems that NOW, when you launch a 'paravirtual' ami, the root user's 'authorized_keys' file is always overwritten. This means the workaround doesn't work anymore! I would LOVE for someone to verify this. Just to point out, I am trying to make this work with a paravirtual instance and not an HVM instance. Please and thanks, Marco. On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman shivaram.venkatara...@gmail.com wrote: Right now the spark-ec2 scripts assume that you have root access and a lot of internal scripts assume have the user's home directory hard coded as /root. However all the Spark AMIs we build should have root ssh access -- Do you find this not to be the case ? You can also enable root ssh access in a vanilla AMI by editing /etc/ssh/sshd_config and setting PermitRootLogin to yes Thanks Shivaram On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini silvio.costant...@granatads.com wrote: Hi all, On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh. Also, it is the default user for the Spark-EC2 script. Currently, the Amazon Linux images have an 'ec2-user' set up for ssh instead of 'root'. I can see that the Spark-EC2 script allows you to specify which user to log in with, but even when I change this, the script fails for various reasons. And the output SEEMS that the script is still based on the specified user's home directory being '/root'. Am I using this script wrong? Has anyone had success with this 'ec2-user' user? Any ideas? Please and thank you, Marco.
CheckpointRDD has different number of partitions than original RDD
Hello, Spark community! My name is Paul. I am a Spark newbie, evaluating version 0.9.0 without any Hadoop at all, and need some help. I run into the following error with the StatefulNetworkWordCount example (and similarly in my prototype app, when I use the updateStateByKey operation). I get this when running against my small cluster, but not (so far) against local[2]. 61904 [spark-akka.actor.default-dispatcher-2] ERROR org.apache.spark.streaming.scheduler.JobScheduler - Error running job streaming job 1396905956000 ms.0 org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[310] at take at DStream.scala:586(0) has different number of partitions than original RDD MapPartitionsRDD[309] at mapPartitions at StateDStream.scala:66(2) at org.apache.spark.rdd.RDDCheckpointData.doCheckpoint(RDDCheckpointData.scala:99) at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:989) at org.apache.spark.SparkContext.runJob(SparkContext.scala:855) at org.apache.spark.SparkContext.runJob(SparkContext.scala:870) at org.apache.spark.SparkContext.runJob(SparkContext.scala:884) at org.apache.spark.rdd.RDD.take(RDD.scala:844) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:586) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:585) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:155) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Please let me know what other information would be helpful; I didn't find any question submission guidelines. Thanks, Paul
Re: Creating a SparkR standalone job
Thanks Shivaram! Will give it a try and let you know. Regards, Pawan Venugopal On Mon, Apr 7, 2014 at 3:38 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: You can create standalone jobs in SparkR as just R files that are run using the sparkR script. These commands will be sent to a Spark cluster and the examples on the SparkR repository ( https://github.com/amplab-extras/SparkR-pkg#examples-unit-tests) are in fact standalone jobs. However I don't think that will completely solve your use case of using Streaming + R. We don't yet have a way to call R functions from Spark's Java or Scala API. So right now one thing you can try is to save data from SparkStreaming to HDFS and then run a SparkR job which reads in the file. Regarding the other idea of calling R from Scala -- it might be possible to do that in your code if the classpath etc. is setup correctly. I haven't tried it out though, but do let us know if you get it to work. Thanks Shivaram On Mon, Apr 7, 2014 at 2:21 PM, pawan kumar pkv...@gmail.com wrote: Hi, Is it possible to create a standalone job in scala using sparkR? If possible can you provide me with the information of the setup process. (Like the dependencies in SBT and where to include the JAR files) This is my use-case: 1. I have a Spark Streaming standalone Job running in local machine which streams twitter data. 2. I have an R script which performs Sentiment Analysis. I am looking for an optimal way where I could combine these two operations into a single job and run using SBT Run command. I came across this document which talks about embedding R into scala ( http://dahl.byu.edu/software/jvmr/dahl-payne-uppalapati-2013.pdf) but was not sure if that would work well within the spark context. Thanks, Pawan Venugopal
job offering
Hi, I am looking for users of spark to join my teams here at Amazon. If you are reading this you probably qualify. I am looking for developer of ANY level, but with an interest in spark. My teams are leveraging spark to solve real business scenarios. If you are interested, just shoot me a note and tell you more about the opportunities here in Seattle. Best, Severan Rault Director Software development Amazon
Re: CheckpointRDD has different number of partitions than original RDD
Few things that would be helpful. 1. Environment settings - you can find them on the environment tab in the Spark application UI 2. Are you setting the HDFS configuration correctly in your Spark program? For example, can you write a HDFS file from a Spark program (say spark-shell) to your HDFS installation and read it back into Spark (i.e., create a RDD)? You can test this by write an RDD as a text file from the shell, and then try to read it back from another shell. 3. If that works, then lets try explicitly checkpointing an RDD. To do this you can take any RDD and do the following. myRDD.checkpoint() myRDD.count() If there is some issue, then this should reproduce the above error. TD On Mon, Apr 7, 2014 at 3:48 PM, Paul Mogren pmog...@commercehub.com wrote: Hello, Spark community! My name is Paul. I am a Spark newbie, evaluating version 0.9.0 without any Hadoop at all, and need some help. I run into the following error with the StatefulNetworkWordCount example (and similarly in my prototype app, when I use the updateStateByKey operation). I get this when running against my small cluster, but not (so far) against local[2]. 61904 [spark-akka.actor.default-dispatcher-2] ERROR org.apache.spark.streaming.scheduler.JobScheduler - Error running job streaming job 1396905956000 ms.0 org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[310] at take at DStream.scala:586(0) has different number of partitions than original RDD MapPartitionsRDD[309] at mapPartitions at StateDStream.scala:66(2) at org.apache.spark.rdd.RDDCheckpointData.doCheckpoint(RDDCheckpointData.scala:99) at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:989) at org.apache.spark.SparkContext.runJob(SparkContext.scala:855) at org.apache.spark.SparkContext.runJob(SparkContext.scala:870) at org.apache.spark.SparkContext.runJob(SparkContext.scala:884) at org.apache.spark.rdd.RDD.take(RDD.scala:844) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:586) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:585) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:155) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Please let me know what other information would be helpful; I didn't find any question submission guidelines. Thanks, Paul
RDDInfo visibility SPARK-1132
any reason why RDDInfo suddenly became private in SPARK-1132? we are using it to show users status of rdds
Re: RDDInfo visibility SPARK-1132
ok yeah we are using StageInfo and TaskInfo too... On Mon, Apr 7, 2014 at 8:51 PM, Andrew Or and...@databricks.com wrote: Hi Koert, Other users have expressed interest for us to expose similar classes too (i.e. StageInfo, TaskInfo). In the newest release, they will be available as part of the developer API. The particular PR that will change this is: https://github.com/apache/spark/pull/274. Cheers, Andrew On Mon, Apr 7, 2014 at 5:05 PM, Koert Kuipers ko...@tresata.com wrote: any reason why RDDInfo suddenly became private in SPARK-1132? we are using it to show users status of rdds
RE: CheckpointRDD has different number of partitions than original RDD
1.: I will paste the full content of the environment page of the example application running against the cluster at the end of this message. 2. and 3.: Following #2 I was able to see that the count was incorrectly 0 when running against the cluster, and following #3 I was able to get the message: org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[4] at count at console:15(0) has different number of partitions than original RDD MappedRDD[3] at textFile at console:12(2) I think I understand - state checkpoints and other file-exchange operations in Spark cluster require a distributed/shared filesystem, even with just a single-host cluster and the driver/shell on a second host. Is that correct? Thank you, Paul Stages Storage Environment Executors NetworkWordCumulativeCountUpdateStateByKey application UI Environment Runtime Information NameValue Java Home /usr/lib/jvm/jdk1.8.0/jre Java Version1.8.0 (Oracle Corporation) Scala Home Scala Version version 2.10.3 Spark Properties NameValue spark.app.name NetworkWordCumulativeCountUpdateStateByKey spark.cleaner.ttl 3600 spark.deploy.recoveryMode ZOOKEEPER spark.deploy.zookeeper.url pubsub01:2181 spark.driver.host 10.10.41.67 spark.driver.port 37360 spark.fileserver.urihttp://10.10.41.67:40368 spark.home /home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2 spark.httpBroadcast.uri http://10.10.41.67:45440 spark.jars /home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar spark.masterspark://10.10.41.19:7077 System Properties NameValue awt.toolkit sun.awt.X11.XToolkit file.encoding ANSI_X3.4-1968 file.encoding.pkg sun.io file.separator / java.awt.graphicsenvsun.awt.X11GraphicsEnvironment java.awt.printerjob sun.print.PSPrinterJob java.class.version 52.0 java.endorsed.dirs /usr/lib/jvm/jdk1.8.0/jre/lib/endorsed java.ext.dirs /usr/lib/jvm/jdk1.8.0/jre/lib/ext:/usr/java/packages/lib/ext java.home /usr/lib/jvm/jdk1.8.0/jre java.io.tmpdir /tmp java.library.path java.net.preferIPv4Stacktrue java.runtime.name Java(TM) SE Runtime Environment java.runtime.version1.8.0-b132 java.specification.name Java Platform API Specification java.specification.vendor Oracle Corporation java.specification.version 1.8 java.vendor Oracle Corporation java.vendor.url http://java.oracle.com/ java.vendor.url.bug http://bugreport.sun.com/bugreport/ java.version1.8.0 java.vm.infomixed mode java.vm.nameJava HotSpot(TM) 64-Bit Server VM java.vm.specification.name Java Virtual Machine Specification java.vm.specification.vendorOracle Corporation java.vm.specification.version 1.8 java.vm.vendor Oracle Corporation java.vm.version 25.0-b70 line.separator log4j.configuration conf/log4j.properties os.arch amd64 os.name Linux os.version 3.5.0-23-generic path.separator : sun.arch.data.model 64 sun.boot.class.path /usr/lib/jvm/jdk1.8.0/jre/lib/resources.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/rt.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/sunrsasign.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/jsse.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/jce.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/charsets.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/jfr.jar:/usr/lib/jvm/jdk1.8.0/jre/classes sun.boot.library.path /usr/lib/jvm/jdk1.8.0/jre/lib/amd64 sun.cpu.endian little sun.cpu.isalist sun.io.serialization.extendedDebugInfo true sun.io.unicode.encoding UnicodeLittle sun.java.command org.apache.spark.streaming.examples.StatefulNetworkWordCount spark://10.10.41.19:7077 localhost sun.java.launcher SUN_STANDARD sun.jnu.encodingANSI_X3.4-1968 sun.management.compiler HotSpot 64-Bit Tiered Compilers sun.nio.ch.bugLevel sun.os.patch.level unknown user.countryUS user.dir/home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2 user.home /home/pmogren user.language en user.name pmogren user.timezone America/New_York Classpath Entries ResourceSource /home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar System Classpath /home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2/confSystem Classpath /home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar System Classpath http://10.10.41.67:40368/jars/spark-examples_2.10-assembly-0.9.0-incubating.jar Added By User From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: Monday, April 07, 2014 7:54 PM To: user@spark.apache.org Subject: Re: CheckpointRDD has different number of partitions than original RDD Few things that would be helpful. 1. Environment settings - you can find them on the environment tab in the Spark application UI 2.
答复: java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer...
Great!!! When i built it on another disk whose format is ext4, it works right now. hadoop@ubuntu-1:~$ df -Th FilesystemType Size Used Avail Use% Mounted on /dev/sdb6 ext4 135G 8.6G 119G 7% / udev devtmpfs 7.7G 4.0K 7.7G 1% /dev tmpfs tmpfs 3.1G 316K 3.1G 1% /run none tmpfs 5.0M 0 5.0M 0% /run/lock none tmpfs 7.8G 4.0K 7.8G 1% /run/shm /dev/sda1 ext4 112G 3.7G 103G 4% /faststore /home/hadoop/.Private ecryptfs 135G 8.6G 119G 7% /home/hadoop Thanks again, Marcelo Vanzin. Francis.Hu -邮件原件- 发件人: Marcelo Vanzin [mailto:van...@cloudera.com] 发送时间: Saturday, April 05, 2014 1:13 收件人: user@spark.apache.org 主题: Re: java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer... Hi Francis, This might be a long shot, but do you happen to have built spark on an encrypted home dir? (I was running into the same error when I was doing that. Rebuilding on an unencrypted disk fixed the issue. This is a known issue / limitation with ecryptfs. It's weird that the build doesn't fail, but you do get warnings about the long file names.) On Wed, Apr 2, 2014 at 3:26 AM, Francis.Hu francis...@reachjunction.com wrote: I stuck in a NoClassDefFoundError. Any helps that would be appreciated. I download spark 0.9.0 source, and then run this command to build it : SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer$$anonfun$14$$anonfun$apply$5$$anonfun$scala$tools$nsc$transform$UnCurry$UnCurryTransformer$$anonfun$$anonfun$$transformInConstructor$1$1 -- Marcelo
Re: trouble with join on large RDDs
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller bmill...@eecs.berkeley.eduwrote: I am running the latest version of PySpark branch-0.9 and having some trouble with join. One RDD is about 100G (25GB compressed and serialized in memory) with 130K records, the other RDD is about 10G (2.5G compressed and serialized in memory) with 330K records. I load each RDD from HDFS, invoke keyBy to key each record, and then attempt to join the RDDs. The join consistently crashes at the beginning of the reduce phase. Note that when joining the 10G RDD to itself there is no problem. Prior to the crash, several suspicious things happen: -All map output files from the map phase of the join are written to spark.local.dir, even though there should be plenty of memory left to contain the map output. I am reasonably sure *all* map outputs are written to disk because the size of the map output spill directory matches the size of the shuffle write (as displayed in the user interface) for each machine. The shuffle data is written through the buffer cache of the operating system, so you would expect the files to show up there immediately and probably to show up as being their full size when you do ls. In reality though these are likely residing in the OS cache and not on disk. -In the beginning of the reduce phase of the join, memory consumption on each work spikes and each machine runs out of memory (as evidenced by a Cannon allocate memory exception in Java). This is particularly surprising since each machine has 30G of ram and each spark worker has only been allowed 10G. Could you paste the error here? -In the web UI both the Shuffle Spill (Memory) and Shuffle Spill (Disk) fields for each machine remain at 0.0 despite shuffle files being written into spark.local.dir. Shuffle spill is different than the shuffle files written to spark.local.dir. Shuffle spilling is for aggregations that occur on the reduce side of the shuffle. A join like this might not see any spilling. From the logs it looks like your executor has died. Would you be able to paste the log from the executor with the exact failure? It would show up in the /work directory inside of spark's directory on the cluster.
[BLOG] For Beginners
Hi all, Here I am sharing a blog for beginners, about creating spark streaming stand alone application and bundle the app as single runnable jar. Take a look and drop your comments in blog page. http://prabstechblog.blogspot.in/2014/04/a-standalone-spark-application-in-scala.html http://prabstechblog.blogspot.in/2014/04/creating-single-jar-for-spark-project.html prabeesh
Mongo-Hadoop Connector with Spark
Hi Everyone, I saved a 2GB pdf file into MongoDB using GridFS. now i want process those GridFS collection data using Java Spark Mapreduce. previously i have successfully processed normal mongoDB collections(not GridFS) with Apache spark using Mongo-Hadoop connector. now i'm unable to handle input GridFS collections. My question here is.. 1) how to pass MongoDB GridFS data as input to our spark application 2)Do we have separate RDD to handle GridFS binary data... I'm trying with following snippet but I'm unable to get actual data. MongoConfigUtil.setInputURI(config, mongodb://localhost:27017/pdfbooks.fs.chunks ); MongoConfigUtil.setOutputURI(config,mongodb://localhost:27017/+output ); JavaPairRDDObject, BSONObject mongoRDD = sc.newAPIHadoopRDD(config, com.mongodb.hadoop.MongoInputFormat.class, Object.class, BSONObject.class); JavaRDDString words = mongoRDD.flatMap(new FlatMapFunctionTuple2Object,BSONObject, String() { @Override public IterableString call(Tuple2Object, BSONObject arg) { System.out.println(arg._2.toString()); ... please suggest me available ways to do...Thank you in Advance!!!