Re: java.io.IOException: No space left on device
Makes sense. "/" is where /tmp would be. However, 230G should be plenty of space. If you have INFO logging turned on (set in $SPARK_HOME/conf/log4j.properties), you'll see messages about saving data to disk that will list sizes. The web console also has some summary information about this. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Wed, Apr 29, 2015 at 6:25 AM, selim namsi wrote: > This is the output of df -h so as you can see I'm using only one disk > mounted on / > > df -h > Filesystem Size Used Avail Use% Mounted on > /dev/sda8 276G 34G 229G 13% /none4.0K 0 4.0K 0% > /sys/fs/cgroup > udev7.8G 4.0K 7.8G 1% /dev > tmpfs 1.6G 1.4M 1.6G 1% /runnone5.0M 0 5.0M > 0% /run/locknone7.8G 37M 7.8G 1% /run/shmnone > 100M 40K 100M 1% /run/user > /dev/sda1 496M 55M 442M 11% /boot/efi > > Also when running the program, I noticed that the Used% disk space related > to the partition mounted on "/" was growing very fast > > On Wed, Apr 29, 2015 at 12:19 PM Anshul Singhle > wrote: > >> Do you have multiple disks? Maybe your work directory is not in the right >> disk? >> >> On Wed, Apr 29, 2015 at 4:43 PM, Selim Namsi >> wrote: >> >>> Hi, >>> >>> I'm using spark (1.3.1) MLlib to run random forest algorithm on tfidf >>> output,the training data is a file containing 156060 (size 8.1M). >>> >>> The problem is that when trying to presist a partition into memory and >>> there >>> is not enought memory, the partition is persisted on disk and despite >>> Having >>> 229G of free disk space, I got " No space left on device".. >>> >>> This is how I'm running the program : >>> >>> ./spark-submit --class com.custom.sentimentAnalysis.MainPipeline --master >>> local[2] --driver-memory 5g ml_pipeline.jar labeledTrainData.tsv >>> testData.tsv >>> >>> And this is a part of the log: >>> >>> >>> >>> If you need more informations, please let me know. >>> Thanks >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-No-space-left-on-device-tp22702.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>>
Re: java.io.IOException: No space left on device
Or multiple volumes. The LOCAL_DIRS (YARN) and SPARK_LOCAL_DIRS (Mesos, Standalone) environment variables and the spark.local.dir property control where temporary data is written. The default is /tmp. See http://spark.apache.org/docs/latest/configuration.html#runtime-environment for more details. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Wed, Apr 29, 2015 at 6:19 AM, Anshul Singhle wrote: > Do you have multiple disks? Maybe your work directory is not in the right > disk? > > On Wed, Apr 29, 2015 at 4:43 PM, Selim Namsi > wrote: > >> Hi, >> >> I'm using spark (1.3.1) MLlib to run random forest algorithm on tfidf >> output,the training data is a file containing 156060 (size 8.1M). >> >> The problem is that when trying to presist a partition into memory and >> there >> is not enought memory, the partition is persisted on disk and despite >> Having >> 229G of free disk space, I got " No space left on device".. >> >> This is how I'm running the program : >> >> ./spark-submit --class com.custom.sentimentAnalysis.MainPipeline --master >> local[2] --driver-memory 5g ml_pipeline.jar labeledTrainData.tsv >> testData.tsv >> >> And this is a part of the log: >> >> >> >> If you need more informations, please let me know. >> Thanks >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-No-space-left-on-device-tp22702.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Questions about Accumulators
Yes, correct. However, note that when an accumulator operation is *idempotent*, meaning that repeated application for the same data behaves exactly like one application, then that accumulator can be safely called in transformation steps (non-actions), too. For example, max and min tracking. Just last week I wrote one that used a hash map to track the latest timestamps seen for specific keys. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Sun, May 3, 2015 at 8:07 AM, xiazhuchang wrote: > “For accumulator updates performed inside actions only, Spark guarantees > that > each task’s update to the accumulator will only be applied once, i.e. > restarted tasks will not update the value. In transformations, users should > be aware of that each task’s update may be applied more than once if tasks > or job stages are re-executed. ” > Is this mean the guarantees(accumulator only be updated once) only in > actions? That is to say, one should use the accumulator only in actions, > orelse there may be some errors(update more than once) if used in > transformations? > e.g. map(x => accumulator += x) > After executed, the correct result of accumulator should be "1"; > Unfortunately, some errors happened, restart task, the map() operation > re-executed(map(x => accumulator += x) re-executed), then the final result > of acculumator will be "2", twice as the correct result? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Questions-about-Accumulators-tp22746p22747.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: PriviledgedActionException- Executor error
So, it looks like the Akka's remote call from your master node, where the CoarseGrainedExecutorBackend is running, to one or more slave nodes is timing out. By default, the port is 2552. Do you have a firewall between the nodes in your cluster that might be blocking this port. If you're not sure, try logging into your master and running the command "telnet slave_addr 2552" where "slave_addr" is one of the slave's IP addresses or a routable host name. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Sun, May 3, 2015 at 4:25 AM, podioss wrote: > Hi, > i am running several jobs in standalone mode and i notice this error in the > log files in some of my nodes at the start of my jobs: > > INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for > [TERM, HUP, INT] > INFO spark.SecurityManager: Changing view acls to: root > INFO spark.SecurityManager: Changing modify acls to: root > INFO spark.SecurityManager: SecurityManager: authentication disabled; ui > acls disabled; users with view permissions: NFO slf4j.Slf4jLogger: > Slf4jLogger started > INFO Remoting: Starting remoting > ERROR security.UserGroupInformation: PriviledgedActionException as:root > cause:java.util.concurrent.TimeoutException: Futures timed out after [1 > milliseconds] > Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: > Unknown exception in doAs > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134) > at > > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) > at > > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115) > at > > org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163) > at > > org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) > Caused by: java.security.PrivilegedActionException: > java.util.concurrent.TimeoutException: Futures timed out after [1 > milliseconds] > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > ... 4 more > Caused by: java.util.concurrent.TimeoutException: Futures timed out after > [1 milliseconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at akka.remote.Remoting.start(Remoting.scala:180) > at > akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) > at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618) > at > akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615) > at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615) > at akka.actor.ActorSystemImpl.start(ActorSystem.scala:632) > at akka.actor.ActorSystem$.apply(ActorSystem.scala:141) > at akka.actor.ActorSystem$.apply(ActorSystem.scala:118) > at > > org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121) > at > org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54) > at > org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53) > at > > org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1676) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at > org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1667) > at > org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56) > at > > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:122) > at > > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60) > at > > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59) > ... 7 more > INFO remote.RemoteActorRefProvider$Re
Re: Spark distributed SQL: JSON Data set on all worker node
Note that each JSON object has to be on a single line in the files. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Sun, May 3, 2015 at 4:14 AM, ayan guha wrote: > Yes it is possible. You need to use jsonfile method on SQL context and > then create a dataframe from the rdd. Then register it as a table. Should > be 3 lines of code, thanks to spark. > > You may see few YouTube video esp for unifying pipelines. > On 3 May 2015 19:02, "Jai" wrote: > >> Hi, >> >> I am noob to spark and related technology. >> >> i have JSON stored at same location on all worker clients spark cluster). >> I am looking to load JSON data set on these clients and do SQL query, like >> distributed SQL. >> >> is it possible to achieve? >> >> right now, master submits task to one node only. >> >> Thanks and regards >> Mrityunjay >> >
Re: Spark - Timeout Issues - OutOfMemoryError
How big is the data you're returning to the driver with collectAsMap? You are probably running out of memory trying to copy too much data back to it. If you're trying to force a map-side join, Spark can do that for you in some cases within the regular DataFrame/RDD context. See http://spark.apache.org/docs/latest/sql-programming-guide.html#performance-tuning and this talk by Michael Armbrust for example, http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-SQL-Michael-Armbrust.pdf. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Thu, Apr 30, 2015 at 12:40 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Full Exception > *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Stage 1 (collectAsMap at > VISummaryDataProvider.scala:37) failed in 884.087 s* > *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Job 0 failed: collectAsMap > at VISummaryDataProvider.scala:37, took 1093.418249 s* > 15/04/30 09:59:49 ERROR yarn.ApplicationMaster: User class threw > exception: Job aborted due to stage failure: Exception while getting task > result: org.apache.spark.SparkException: Error sending message [message = > GetLocations(taskresult_112)] > org.apache.spark.SparkException: Job aborted due to stage failure: > Exception while getting task result: org.apache.spark.SparkException: Error > sending message [message = GetLocations(taskresult_112)] > at org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > 15/04/30 09:59:49 INFO yarn.ApplicationMaster: Final app status: FAILED, > exitCode: 15, (reason: User class threw exception: Job aborted due to stage > failure: Exception while getting task result: > org.apache.spark.SparkException: Error sending message [message = > GetLocations(taskresult_112)]) > > > *Code at line 37* > > val lstgItemMap = listings.map { lstg => (lstg.getItemId().toLong, lstg) } > .collectAsMap > > Listing data set size is 26G (10 files) and my driver memory is 12G (I > cant go beyond it). The reason i do collectAsMap is to brodcast it and do a > map-side join instead of regular join. > > > Please suggest ? > > > On Thu, Apr 30, 2015 at 10:52 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) > wrote: > >> My Spark Job is failing and i see >> >> == >> >> 15/04/30 09:59:49 ERROR yarn.ApplicationMaster: User class threw >> exception: Job aborted due to stage failure: Exception while getting task >> result: org.apache.spark.SparkException: Error sending message [message = >> GetLocations(taskresult_112)] >> >> org.apache.spark.SparkException: Job aborted due to stage failure: >> Exception while getting task result: org.apache.spark.SparkException: Error >> sending message [message = GetLocations(taskresult_112)] >> >> at org.apache.spark.scheduler.DAGScheduler.org >> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) >> >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) >> >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) >> >> at >> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >> >> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >> >> at >> org.apache.spark.scheduler.DAGSchedul
Re: Spark - Timeout Issues - OutOfMemoryError
IMHO, you are trying waaay to hard to optimize work on what is really a small data set. 25G, even 250G, is not that much data, especially if you've spent a month trying to get something to work that should be simple. All these errors are from optimization attempts. Kryo is great, but if it's not working reliably for some reason, then don't use it. Rather than force 200 partitions, let Spark try to figure out a good-enough number. (If you really need to force a partition count, use the repartition method instead, unless you're overriding the partitioner.) So. I recommend that you eliminate all the optimizations: Kryo, partitionBy, etc. Just use the simplest code you can. Make it work first. Then, if it really isn't fast enough, look for actual evidence of bottlenecks and optimize those. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Sun, May 3, 2015 at 10:22 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Hello Dean & Others, > Thanks for your suggestions. > I have two data sets and all i want to do is a simple equi join. I have > 10G limit and as my dataset_1 exceeded that it was throwing OOM error. > Hence i switched back to use .join() API instead of map-side broadcast > join. > I am repartitioning the data with 100,200 and i see a NullPointerException > now. > > When i run against 25G of each input and with .partitionBy(new > org.apache.spark.HashPartitioner(200)) , I see NullPointerExveption > > > this trace does not include a line from my code and hence i do not what is > causing error ? > I do have registered kryo serializer. > > val conf = new SparkConf() > .setAppName(detail) > * .set("spark.serializer", > "org.apache.spark.serializer.KryoSerializer")* > .set("spark.kryoserializer.buffer.mb", > arguments.get("buffersize").get) > .set("spark.kryoserializer.buffer.max.mb", > arguments.get("maxbuffersize").get) > .set("spark.driver.maxResultSize", > arguments.get("maxResultSize").get) > .set("spark.yarn.maxAppAttempts", "0") > * > .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLeve* > lMetricSum])) > val sc = new SparkContext(conf) > > I see the exception when this task runs > > val viEvents = details.map { vi => (vi.get(14).asInstanceOf[Long], vi) } > > Its a simple mapping of input records to (itemId, record) > > I found this > > http://stackoverflow.com/questions/23962796/kryo-readobject-cause-nullpointerexception-with-arraylist > and > > http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html > > Looks like Kryo (2.21v) changed something to stop using default > constructors. > > (Kryo.DefaultInstantiatorStrategy) > kryo.getInstantiatorStrategy()).setFallbackInstantiatorStrategy(new > StdInstantiatorStrategy()); > > > Please suggest > > > Trace: > 15/05/01 03:02:15 ERROR executor.Executor: Exception in task 110.1 in > stage 2.0 (TID 774) > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > values (org.apache.avro.generic.GenericData$Record) > datum (org.apache.avro.mapred.AvroKey) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:41) > at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > Regards, > > > Any suggestions. > I am not able to get this thing to work over a month now, its kind of > getting frustrating. > > On Sun, May 3, 2015 at 8:03 PM, Dean Wampler > wrote: > >> How big is the data you're returning to the driver with collectAsMap? You >> are probably running out of memory trying to copy too much data back to it. >> >> If you're trying to force a map-side join, Spark can do that for you in >> some cases within the regular DataFrame/RDD context. Se
Re: Spark - Timeout Issues - OutOfMemoryError
I don't know the full context of what you're doing, but serialization errors usually mean you're attempting to serialize something that can't be serialized, like the SparkContext. Kryo won't help there. The arguments to spark-submit you posted previously look good: 2) --num-executors 96 --driver-memory 12g --driver-java-options "-XX:MaxPermSize=10G" --executor-memory 12g --executor-cores 4 I suspect you aren't getting the parallelism you need. For partitioning, if your data is in HDFS and your block size is 128MB, then you'll get ~195 partitions anyway. If it takes 7 hours to do a join over 25GB of data, you have some other serious bottleneck. You should examine the web console and the logs to determine where all the time is going. Questions you might pursue: - How long does each task take to complete? - How many of those 195 partitions/tasks are processed at the same time? That is, how many "slots" are available? Maybe you need more nodes if the number of slots is too low. Based on your command arguments, you should be able to process 1/2 of them at a time, unless the cluster is busy. - Is the cluster swamped with other work? - How much data does each task process? Is the data roughly the same from one task to the next? If not, then you might have serious key skew? You may also need to research the details of how joins are implemented and some of the common tricks for organizing data to minimize having to shuffle all N by M records. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Sun, May 3, 2015 at 11:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Hello Deam, > If I don;t use Kryo serializer i got Serialization error and hence am > using it. > If I don';t use partitionBy/reparition then the simply join never > completed even after 7 hours and infact as next step i need to run it > against 250G as that is my full dataset size. Someone here suggested to me > to use repartition. > > Assuming reparition is mandatory , how do i decide whats the right number > ? When i am using 400 i do not get NullPointerException that i talked > about, which is strange. I never saw that exception against small random > dataset but see it with 25G and again with 400 partitions , i do not see it. > > > On Sun, May 3, 2015 at 9:15 PM, Dean Wampler > wrote: > >> IMHO, you are trying waaay to hard to optimize work on what is really a >> small data set. 25G, even 250G, is not that much data, especially if you've >> spent a month trying to get something to work that should be simple. All >> these errors are from optimization attempts. >> >> Kryo is great, but if it's not working reliably for some reason, then >> don't use it. Rather than force 200 partitions, let Spark try to figure out >> a good-enough number. (If you really need to force a partition count, use >> the repartition method instead, unless you're overriding the partitioner.) >> >> So. I recommend that you eliminate all the optimizations: Kryo, >> partitionBy, etc. Just use the simplest code you can. Make it work first. >> Then, if it really isn't fast enough, look for actual evidence of >> bottlenecks and optimize those. >> >> >> >> Dean Wampler, Ph.D. >> Author: Programming Scala, 2nd Edition >> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) >> Typesafe <http://typesafe.com> >> @deanwampler <http://twitter.com/deanwampler> >> http://polyglotprogramming.com >> >> On Sun, May 3, 2015 at 10:22 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) >> wrote: >> >>> Hello Dean & Others, >>> Thanks for your suggestions. >>> I have two data sets and all i want to do is a simple equi join. I have >>> 10G limit and as my dataset_1 exceeded that it was throwing OOM error. >>> Hence i switched back to use .join() API instead of map-side broadcast >>> join. >>> I am repartitioning the data with 100,200 and i see a >>> NullPointerException now. >>> >>> When i run against 25G of each input and with .partitionBy(new >>> org.apache.spark.HashPartitioner(200)) , I see NullPointerExveption >>> >>> >>> this trace does not include a line from my code and hence i do not what >>> is causing error ? >>> I do have registered kryo serializer. >>> >>> val conf = new SparkConf() >>> .setAppName(detail) >>> * .set("spark.serializer", >>> "org.apache.spark
Re: value toDF is not a member of RDD object
It's the import statement Olivier showed that makes the method available. Note that you can also use `sc.createDataFrame(myRDD)`, without the need for the import statement. I personally prefer this approach. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Tue, May 12, 2015 at 9:33 AM, Olivier Girardot wrote: > you need to instantiate a SQLContext : > val sc : SparkContext = ... > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > > Le mar. 12 mai 2015 à 12:29, SLiZn Liu a écrit : > >> I added `libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % >> "1.3.1"` to `build.sbt` but the error remains. Do I need to import modules >> other than `import org.apache.spark.sql.{ Row, SQLContext }`? >> >> On Tue, May 12, 2015 at 5:56 PM Olivier Girardot >> wrote: >> >>> toDF is part of spark SQL so you need Spark SQL dependency + import >>> sqlContext.implicits._ to get the toDF method. >>> >>> Regards, >>> >>> Olivier. >>> >>> Le mar. 12 mai 2015 à 11:36, SLiZn Liu a >>> écrit : >>> >>>> Hi User Group, >>>> >>>> I’m trying to reproduce the example on Spark SQL Programming Guide >>>> <https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection>, >>>> and got a compile error when packaging with sbt: >>>> >>>> [error] myfile.scala:30: value toDF is not a member of >>>> org.apache.spark.rdd.RDD[Person] >>>> [error] val people = >>>> sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p >>>> => Person(p(0), p(1).trim.toInt)).toDF() >>>> [error] >>>> ^ >>>> [error] one error found >>>> [error] (compile:compileIncremental) Compilation failed >>>> [error] Total time: 3 s, completed May 12, 2015 4:11:53 PM >>>> >>>> I double checked my code includes import sqlContext.implicits._ after >>>> reading this post >>>> <https://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3c1426522113299-22083.p...@n3.nabble.com%3E> >>>> on spark mailing list, even tried to use toDF("col1", "col2") >>>> suggested by Xiangrui Meng in that post and got the same error. >>>> >>>> The Spark version is specified in build.sbt file as follows: >>>> >>>> scalaVersion := "2.11.6" >>>> libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "1.3.1" % >>>> "provided" >>>> libraryDependencies += "org.apache.spark" % "spark-mllib_2.11" % "1.3.1" >>>> >>>> Anyone have ideas the cause of this error? >>>> >>>> REGARDS, >>>> Todd Leo >>>> >>>> >>>
Re: Spark SQL: preferred syntax for column reference?
Is the $"foo" or mydf("foo") or both checked at compile time to verify that the column reference is valid? Thx. Dean On Wednesday, May 13, 2015, Michael Armbrust wrote: > I would not say that either method is preferred (neither is > old/deprecated). One advantage to the second is that you are referencing a > column from a specific dataframe, instead of just providing a string that > will be resolved much like an identifier in a SQL query. > > This means given: > df1 = [id: int, name: string ] > df2 = [id: int, zip: int] > > I can do something like: > > df1.join(df2, df1("id") === df2("id")) > > Where as I would need aliases if I was only using strings: > > df1.as("a").join(df2.as("b"), $"a.id" === $"b.id") > > On Wed, May 13, 2015 at 9:55 AM, Diana Carroll > wrote: > >> I'm just getting started with Spark SQL and DataFrames in 1.3.0. >> >> I notice that the Spark API shows a different syntax for referencing >> columns in a dataframe than the Spark SQL Programming Guide. >> >> For instance, the API docs for the select method show this: >> df.select($"colA", $"colB") >> >> >> Whereas the programming guide shows this: >> df.filter(df("name") > 21).show() >> >> I tested and both the $"column" and df(column) syntax works, but I'm >> wondering which is *preferred*. Is one the original and one a new >> feature we should be using? >> >> Thanks, >> Diana >> (Spark Curriculum Developer for Cloudera) >> > > -- Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com
Re: Spark vs Google cloud dataflow
... and to be clear on the point, Summingbird is not limited to MapReduce. It abstracts over Scalding (which abstracts over Cascading, which is being moved from MR to Spark) and over Storm for event processing. On Fri, Jun 27, 2014 at 7:16 AM, Sean Owen wrote: > On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia > wrote: > > Summingbird is for map/reduce. Dataflow is the third generation of > google's > > map/reduce, and it generalizes map/reduce the way Spark does. See more > about > > this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s > > Yes, my point was that Summingbird is similar in that it is a > higher-level service for batch/streaming computation, not that it is > similar for being MapReduce-based. > > > It seems Dataflow is based on this paper: > > http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf > > FlumeJava maps to Crunch in the Hadoop ecosystem. I think Dataflows is > more than that but yeah that seems to be some of the 'language'. It is > similar in that it is a distributed collection abstraction. > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: Recommended pipeline automation tool? Oozie?
If you're already using Scala for Spark programming and you hate Oozie XML as much as I do ;), you might check out Scoozie, a Scala DSL for Oozie: https://github.com/klout/scoozie On Thu, Jul 10, 2014 at 5:52 PM, Andrei wrote: > I used both - Oozie and Luigi - but found them inflexible and still > overcomplicated, especially in presence of Spark. > > Oozie has a fixed list of building blocks, which is pretty limiting. For > example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are > out of scope (of course, you can always write wrapper as Java or Shell > action, but does it really need to be so complicated?). Another issue with > Oozie is passing variables between actions. There's Oozie context that is > suitable for passing key-value pairs (both strings) between actions, but > for more complex objects (say, FileInputStream that should be closed at > last step only) you have to do some advanced kung fu. > > Luigi, on other hand, has its niche - complicated dataflows with many > tasks that depend on each other. Basically, there are tasks (this is where > you define computations) and targets (something that can "exist" - file on > disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it > creates a plan for achieving this. Luigi is really shiny when your workflow > fits this model, but one step away and you are in trouble. For example, > consider simple pipeline: run MR job and output temporary data, run another > MR job and output final data, clean temporary data. You can make target > Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1, > right? Not so easy. How do you check that Clean task is achieved? If you > just test whether temporary directory is empty or not, you catch both cases > - when all tasks are done and when they are not even started yet. Luigi > allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single > "run()" method, but ruins the entire idea. > > And of course, both of these frameworks are optimized for standard > MapReduce jobs, which is probably not what you want on Spark mailing list > :) > > Experience with these frameworks, however, gave me some insights about > typical data pipelines. > > 1. Pipelines are mostly linear. Oozie, Luigi and number of other > frameworks allow branching, but most pipelines actually consist of moving > data from source to destination with possibly some transformations in > between (I'll be glad if somebody share use cases when you really need > branching). > 2. Transactional logic is important. Either everything, or nothing. > Otherwise it's really easy to get into inconsistent state. > 3. Extensibility is important. You never know what will need in a week or > two. > > So eventually I decided that it is much easier to create your own pipeline > instead of trying to adopt your code to existing frameworks. My latest > pipeline incarnation simply consists of a list of steps that are started > sequentially. Each step is a class with at least these methods: > > * run() - launch this step > * fail() - what to do if step fails > * finalize() - (optional) what to do when all steps are done > > For example, if you want to add possibility to run Spark jobs, you just > create SparkStep and configure it with required code. If you want Hive > query - just create HiveStep and configure it with Hive connection > settings. I use YAML file to configure steps and Context (basically, > Map[String, Any]) to pass variables between them. I also use configurable > Reporter available for all steps to report the progress. > > Hopefully, this will give you some insights about best pipeline for your > specific case. > > > > On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown wrote: > >> >> We use Luigi for this purpose. (Our pipelines are typically on AWS (no >> EMR) backed by S3 and using combinations of Python jobs, non-Spark >> Java/Scala, and Spark. We run Spark jobs by connecting drivers/clients to >> the master, and those are what is invoked from Luigi.) >> >> — >> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ >> >> >> On Thu, Jul 10, 2014 at 10:20 AM, k.tham wrote: >> >>> I'm just wondering what's the general recommendation for data pipeline >>> automation. >>> >>> Say, I want to run Spark Job A, then B, then invoke script C, then do D, >>> and >>> if D fails, do E, and if Job A fails, send email F, etc... >>> >>> It looks like Oozie might be the best choice. But I'd like some >>> advice/suggestions. >>> >>> Thanks! >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >> >> > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: Issue with Spark on EC2 using spark-ec2 script
The stack trace suggests it was trying to create a temporary file, not read your file. Of course, it doesn't say what file it couldn't create. Could there be a configuration file, like a Hadoop config file, that was read with a temp dir setting that's invalid for your machine? dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Thu, Jul 31, 2014 at 4:04 PM, Ryan Tabora wrote: > Hey all, > > I was able to spawn up a cluster, but when I'm trying to submit a simple > jar via spark-submit it fails to run. I am trying to run the simple > "Standalone Application" from the quickstart. > > Oddly enough, I could get another application running through the > spark-shell. What am I doing wrong here? :( > > http://spark.apache.org/docs/latest/quick-start.html > > * Here's my setup: * > > $ ls > project simple.sbt src target > > $ ls -R src > src: > main > > src/main: > scala > > src/main/scala: > SimpleApp.scala > > $ cat src/main/scala/SimpleApp.scala > package main.scala > > /* SimpleApp.scala */ > import org.apache.spark.SparkContext > import org.apache.spark.SparkContext._ > import org.apache.spark.SparkConf > > object SimpleApp { > def main(args: Array[String]) { > val logFile = "/tmp/README.md" > val conf = new SparkConf().setAppName("Simple Application") > val sc = new SparkContext(conf) > val logData = sc.textFile(logFile, 2).cache() > val numAs = logData.filter(line => > line.contains("a")).count() > val numBs = logData.filter(line => > line.contains("b")).count() > println("Lines with a: %s, Lines with b: > %s".format(numAs, numBs)) > } > } > > $ cat simple.sbt > name := "Simple Project" > > version := "1.0" > > scalaVersion := "2.10.4" > > libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.1" > > resolvers += "Akka Repository" at "http://repo.akka.io/releases/"; > > * Here's how I run the job: * > > $ /root/spark/bin/spark-submit --class "main.scala.SimpleApp" --master > local[4] ./target/scala-2.10/simple-project_2.10-1.0.jar > > *Here is the error: * > > 14/07/31 16:23:56 INFO scheduler.DAGScheduler: Failed to run count at > SimpleApp.scala:14 > Exception in thread "main" org.apache.spark.SparkException: Job aborted > due to stage failure: Task 0.0:1 failed 1 times, most recent failure: > Exception failure in TID 1 on host localhost: java.io.IOException: No such > file or directory > java.io.UnixFileSystem.createFileExclusively(Native Method) > java.io.File.createNewFile(File.java:1006) > java.io.File.createTempFile(File.java:1989) > org.apache.spark.util.Utils$.fetchFile(Utils.scala:326) > > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:332) > > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) > scala.collection.mutable.HashMap.foreach(HashMap.scala:98) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.executor.Executor.org > $apache$spark$executor$Executor$$updateDependencies(Executor.scala:330) > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:168) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) >
Re: Issue with Spark on EC2 using spark-ec2 script
Forgot to add that I tried your program with the same input file path. It worked fine. (I used local[2], however...) Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Thu, Jul 31, 2014 at 5:07 PM, Dean Wampler wrote: > The stack trace suggests it was trying to create a temporary file, not > read your file. Of course, it doesn't say what file it couldn't create. > > Could there be a configuration file, like a Hadoop config file, that was > read with a temp dir setting that's invalid for your machine? > > dean > > Dean Wampler, Ph.D. > Author: Programming Scala, 2nd Edition > <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) > Typesafe <http://typesafe.com> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > > > On Thu, Jul 31, 2014 at 4:04 PM, Ryan Tabora wrote: > >> Hey all, >> >> I was able to spawn up a cluster, but when I'm trying to submit a simple >> jar via spark-submit it fails to run. I am trying to run the simple >> "Standalone Application" from the quickstart. >> >> Oddly enough, I could get another application running through the >> spark-shell. What am I doing wrong here? :( >> >> http://spark.apache.org/docs/latest/quick-start.html >> >> * Here's my setup: * >> >> $ ls >> project simple.sbt src target >> >> $ ls -R src >> src: >> main >> >> src/main: >> scala >> >> src/main/scala: >> SimpleApp.scala >> >> $ cat src/main/scala/SimpleApp.scala >> package main.scala >> >> /* SimpleApp.scala */ >> import org.apache.spark.SparkContext >> import org.apache.spark.SparkContext._ >> import org.apache.spark.SparkConf >> >> object SimpleApp { >> def main(args: Array[String]) { >> val logFile = "/tmp/README.md" >> val conf = new SparkConf().setAppName("Simple Application") >> val sc = new SparkContext(conf) >> val logData = sc.textFile(logFile, 2).cache() >> val numAs = logData.filter(line => >> line.contains("a")).count() >> val numBs = logData.filter(line => >> line.contains("b")).count() >> println("Lines with a: %s, Lines with >> b: %s".format(numAs, numBs)) >> } >> } >> >> $ cat simple.sbt >> name := "Simple Project" >> >> version := "1.0" >> >> scalaVersion := "2.10.4" >> >> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.1" >> >> resolvers += "Akka Repository" at "http://repo.akka.io/releases/"; >> >> * Here's how I run the job: * >> >> $ /root/spark/bin/spark-submit --class "main.scala.SimpleApp" --master >> local[4] ./target/scala-2.10/simple-project_2.10-1.0.jar >> >> *Here is the error: * >> >> 14/07/31 16:23:56 INFO scheduler.DAGScheduler: Failed to run count at >> SimpleApp.scala:14 >> Exception in thread "main" org.apache.spark.SparkException: Job aborted >> due to stage failure: Task 0.0:1 failed 1 times, most recent failure: >> Exception failure in TID 1 on host localhost: java.io.IOException: No such >> file or directory >> java.io.UnixFileSystem.createFileExclusively(Native Method) >> java.io.File.createNewFile(File.java:1006) >> java.io.File.createTempFile(File.java:1989) >> org.apache.spark.util.Utils$.fetchFile(Utils.scala:326) >> >> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:332) >> >> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330) >> >> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) >> >> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) >> >> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) >> >> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) >> scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) >> scala.
Re: Issue with Spark on EC2 using spark-ec2 script
It looked like you were running in standalone mode (master set to local[4]). That's how I ran it. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Thu, Jul 31, 2014 at 8:37 PM, ratabora wrote: > Hey Dean! Thanks! > > Did you try running this on a local environment or one generated by the > spark-ec2 script? > > The environment I am running on is a 4 data node 1 master spark cluster > generated by the spark-ec2 script. I haven't modified anything in the > environment except for adding data to the ephemeral hdfs. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-Spark-on-EC2-using-spark-ec2-script-tp11088p7.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
Re: Dependency Problem with Spark / ScalaTest / SBT
Can you post your whole SBT build file(s)? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Wed, Sep 10, 2014 at 6:48 AM, Thorsten Bergler wrote: > Hi, > > I just called: > > > test > > or > > > run > > Thorsten > > > Am 10.09.2014 um 13:38 schrieb arthur.hk.c...@gmail.com: > > Hi, >> >> What is your SBT command and the parameters? >> >> Arthur >> >> >> On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler wrote: >> >> Hello, >>> >>> I am writing a Spark App which is already working so far. >>> Now I started to build also some UnitTests, but I am running into some >>> dependecy problems and I cannot find a solution right now. Perhaps someone >>> could help me. >>> >>> I build my Spark Project with SBT and it seems to be configured well, >>> because compiling, assembling and running the built jar with spark-submit >>> are working well. >>> >>> Now I started with the UnitTests, which I located under /src/test/scala. >>> >>> When I call "test" in sbt, I get the following: >>> >>> 14/09/10 12:22:06 INFO storage.BlockManagerMaster: Registered >>> BlockManager >>> 14/09/10 12:22:06 INFO spark.HttpServer: Starting HTTP Server >>> [trace] Stack trace suppressed: run last test:test for the full output. >>> [error] Could not run test test.scala.SetSuite: >>> java.lang.NoClassDefFoundError: >>> javax/servlet/http/HttpServletResponse >>> [info] Run completed in 626 milliseconds. >>> [info] Total number of tests run: 0 >>> [info] Suites: completed 0, aborted 0 >>> [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 >>> [info] All tests passed. >>> [error] Error during tests: >>> [error] test.scala.SetSuite >>> [error] (test:test) sbt.TestsFailedException: Tests unsuccessful >>> [error] Total time: 3 s, completed 10.09.2014 12:22:06 >>> >>> last test:test gives me the following: >>> >>> last test:test >>>> >>> [debug] Running TaskDef(test.scala.SetSuite, >>> org.scalatest.tools.Framework$$anon$1@6e5626c8, false, [SuiteSelector]) >>> java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse >>> at org.apache.spark.HttpServer.start(HttpServer.scala:54) >>> at org.apache.spark.broadcast.HttpBroadcast$.createServer( >>> HttpBroadcast.scala:156) >>> at org.apache.spark.broadcast.HttpBroadcast$.initialize( >>> HttpBroadcast.scala:127) >>> at org.apache.spark.broadcast.HttpBroadcastFactory.initialize( >>> HttpBroadcastFactory.scala:31) >>> at org.apache.spark.broadcast.BroadcastManager.initialize( >>> BroadcastManager.scala:48) >>> at org.apache.spark.broadcast.BroadcastManager.( >>> BroadcastManager.scala:35) >>> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218) >>> at org.apache.spark.SparkContext.(SparkContext.scala:202) >>> at test.scala.SetSuite.(SparkTest.scala:16) >>> >>> I also noticed right now, that sbt run is also not working: >>> >>> 14/09/10 12:44:46 INFO spark.HttpServer: Starting HTTP Server >>> [error] (run-main-2) java.lang.NoClassDefFoundError: javax/servlet/http/ >>> HttpServletResponse >>> java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse >>> at org.apache.spark.HttpServer.start(HttpServer.scala:54) >>> at org.apache.spark.broadcast.HttpBroadcast$.createServer( >>> HttpBroadcast.scala:156) >>> at org.apache.spark.broadcast.HttpBroadcast$.initialize( >>> HttpBroadcast.scala:127) >>> at org.apache.spark.broadcast.HttpBroadcastFactory.initialize( >>> HttpBroadcastFactory.scala:31) >>> at org.apache.spark.broadcast.BroadcastManager.initialize( >>> BroadcastManager.scala:48) >>> at org.apache.spark.broadcast.BroadcastManager.( >>> BroadcastManager.scala:35) >>> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218) >>> at org.apache.spark.SparkContext.(SparkContext.scala:202) >>> at main.scala.PartialDuplicateScanner$.main( >>> PartialDuplicateScanner.scala:29) >>> at main.scala.PartialDuplicateScanner.main( >>> PartialDuplicateScanner.scala) >>> >>&g
Re: Dependency Problem with Spark / ScalaTest / SBT
Sorry, I meant any *other* SBT files. However, what happens if you remove the line: exclude("org.eclipse.jetty.orbit", "javax.servlet") dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Sun, Sep 14, 2014 at 11:53 AM, Dean Wampler wrote: > Can you post your whole SBT build file(s)? > > Dean Wampler, Ph.D. > Author: Programming Scala, 2nd Edition > <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) > Typesafe <http://typesafe.com> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > > On Wed, Sep 10, 2014 at 6:48 AM, Thorsten Bergler > wrote: > >> Hi, >> >> I just called: >> >> > test >> >> or >> >> > run >> >> Thorsten >> >> >> Am 10.09.2014 um 13:38 schrieb arthur.hk.c...@gmail.com: >> >> Hi, >>> >>> What is your SBT command and the parameters? >>> >>> Arthur >>> >>> >>> On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler wrote: >>> >>> Hello, >>>> >>>> I am writing a Spark App which is already working so far. >>>> Now I started to build also some UnitTests, but I am running into some >>>> dependecy problems and I cannot find a solution right now. Perhaps someone >>>> could help me. >>>> >>>> I build my Spark Project with SBT and it seems to be configured well, >>>> because compiling, assembling and running the built jar with spark-submit >>>> are working well. >>>> >>>> Now I started with the UnitTests, which I located under /src/test/scala. >>>> >>>> When I call "test" in sbt, I get the following: >>>> >>>> 14/09/10 12:22:06 INFO storage.BlockManagerMaster: Registered >>>> BlockManager >>>> 14/09/10 12:22:06 INFO spark.HttpServer: Starting HTTP Server >>>> [trace] Stack trace suppressed: run last test:test for the full output. >>>> [error] Could not run test test.scala.SetSuite: >>>> java.lang.NoClassDefFoundError: >>>> javax/servlet/http/HttpServletResponse >>>> [info] Run completed in 626 milliseconds. >>>> [info] Total number of tests run: 0 >>>> [info] Suites: completed 0, aborted 0 >>>> [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 >>>> [info] All tests passed. >>>> [error] Error during tests: >>>> [error] test.scala.SetSuite >>>> [error] (test:test) sbt.TestsFailedException: Tests unsuccessful >>>> [error] Total time: 3 s, completed 10.09.2014 12:22:06 >>>> >>>> last test:test gives me the following: >>>> >>>> last test:test >>>>> >>>> [debug] Running TaskDef(test.scala.SetSuite, >>>> org.scalatest.tools.Framework$$anon$1@6e5626c8, false, [SuiteSelector]) >>>> java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse >>>> at org.apache.spark.HttpServer.start(HttpServer.scala:54) >>>> at org.apache.spark.broadcast.HttpBroadcast$.createServer( >>>> HttpBroadcast.scala:156) >>>> at org.apache.spark.broadcast.HttpBroadcast$.initialize( >>>> HttpBroadcast.scala:127) >>>> at org.apache.spark.broadcast.HttpBroadcastFactory.initialize( >>>> HttpBroadcastFactory.scala:31) >>>> at org.apache.spark.broadcast.BroadcastManager.initialize( >>>> BroadcastManager.scala:48) >>>> at org.apache.spark.broadcast.BroadcastManager.( >>>> BroadcastManager.scala:35) >>>> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218) >>>> at org.apache.spark.SparkContext.(SparkContext.scala:202) >>>> at test.scala.SetSuite.(SparkTest.scala:16) >>>> >>>> I also noticed right now, that sbt run is also not working: >>>> >>>> 14/09/10 12:44:46 INFO spark.HttpServer: Starting HTTP Server >>>> [error] (run-main-2) java.lang.NoClassDefFoundError: >>>> javax/servlet/http/HttpServletResponse >>>> java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse >>>> at org.apache.spark.HttpServer.start(HttpServer.scala:54) >>>> at org.apache.spark.broadcast.HttpBroadcast$.createServer( >&
Re: SparkSQL Thriftserver in Mesos
The Mesos install guide says this: "To use Mesos from Spark, you need a Spark binary package available in a place accessible by Mesos, and a Spark driver program configured to connect to Mesos." For example, putting it in HDFS or copying it to each node in the same location should do the trick. https://spark.apache.org/docs/latest/running-on-mesos.html Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Mon, Sep 22, 2014 at 2:35 PM, John Omernik wrote: > Any thoughts on this? > > On Sat, Sep 20, 2014 at 12:16 PM, John Omernik wrote: > >> I am running the Thrift server in SparkSQL, and running it on the node I >> compiled spark on. When I run it, tasks only work if they landed on that >> node, other executors started on nodes I didn't compile spark on (and thus >> don't have the compile directory) fail. Should spark be distributed >> properly with the executor uri in my spark-defaults for mesos? >> >> Here is the error on nodes with Lost executors >> >> sh: 1: /opt/mapr/spark/spark-1.1.0-SNAPSHOT/sbin/spark-executor: not found >> >> >> >
Re: scala Vector vs mllib Vector
Briefly, MLlib's Vector and the concrete subclasses DenseVector and SparkVector wrap Java arrays, which are mutable and maximize memory efficiency. To update one of these vectors, you mutate the elements of the underlying array. That's great for performance, but dangerous in multithreaded programs for all the usual reasons. Scala's Vector is a *persistent data structure* (best to google that term...), with O(1) operations, but a higher constant factor. Scala Vector instances are immutable, so mutating operations return a new Vector, but the "persistent" implementation uses structure sharing (the unchanged parts) to make efficient copies. Also, Scala Vector isn't designed to represent sparse vectors. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Sat, Oct 4, 2014 at 1:44 AM, ll wrote: > what are the pros/cons of each? when should we use mllib Vector, and when > to > use standard scala Vector? thanks. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/scala-Vector-vs-mllib-Vector-tp15736.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: scala Vector vs mllib Vector
Spark isolates each task, so I would use the MLlib vector. I didn't mention this, but it also integrates with Breeze, a Scala mathematics library that you might find useful. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Sat, Oct 4, 2014 at 8:52 AM, ll wrote: > thanks dean. thanks for the answer with great clarity! > > i'm working on an algorithm that has a weight vector W(w0, w1, .., wN). > the > elements of this weight vector are adjusted/updated frequently - every > iteration of the algorithm. how would you recommend to implement this > vector? what is the best practice to implement this in Scala & Spark? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/scala-Vector-vs-mllib-Vector-tp15736p15741.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Spark Project Fails to run multicore in local mode.
Iterator$class.foreach(Iterator.scala:727) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at > org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:765) at > org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:765) at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at > org.apache.spark.scheduler.Task.run(Task.scala:56) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > -- > View this message in context: Spark Project Fails to run multicore in > local mode. > <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Project-Fails-to-run-multicore-in-local-mode-tp21034.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. > -- Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com
Re: Spark - ready for prime time?
ting more and more > cached. This is all fine and good, but in the meanwhile I see stages > running on the UI that point to code which is used to load A and B. How is > this possible? Am I misunderstanding how cached RDDs should behave? > > And again the general question - how can one debug such issues? > > 4. Shuffle on disk > Is it true - I couldn't find it in official docs, but did see this > mentioned in various threads - that shuffle _always_ hits disk? > (Disregarding OS caches.) Why is this the case? Are you planning to add a > function to do shuffle in memory or are there some intrinsic reasons for > this to be impossible? > > > Sorry again for the giant mail, and thanks for any insights! > > Andras > > > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: Spark - ready for prime time?
Here are several good ones: https://www.google.com/search?q=cloudera+spark&oq=cloudera+spark&aqs=chrome..69i57j69i65l3j69i60l2.4439j0j7&sourceid=chrome&espv=2&es_sm=119&ie=UTF-8 On Thu, Apr 10, 2014 at 10:42 AM, Ian Ferreira wrote: > Do you have the link to the Cloudera comment? > > Sent from Windows Mail > > *From:* Dean Wampler > *Sent:* Thursday, April 10, 2014 7:39 AM > *To:* Spark Users > *Cc:* Daniel Darabos , Andras > Barjak > > Spark has been endorsed by Cloudera as the successor to MapReduce. That > says a lot... > > > On Thu, Apr 10, 2014 at 10:11 AM, Andras Nemeth < > andras.nem...@lynxanalytics.com> wrote: > >> Hello Spark Users, >> >> With the recent graduation of Spark to a top level project (grats, btw!), >> maybe a well timed question. :) >> >> We are at the very beginning of a large scale big data project and after >> two months of exploration work we'd like to settle on the technologies to >> use, roll up our sleeves and start to build the system. >> >> Spark is one of the forerunners for our technology choice. >> >> My question in essence is whether it's a good idea or is Spark too >> 'experimental' just yet to bet our lives (well, the project's life) on it. >> >> The benefits of choosing Spark are numerous and I guess all too obvious >> for this audience - e.g. we love its powerful abstraction, ease of >> development and the potential for using a single system for serving and >> manipulating huge amount of data. >> >> This email aims to ask about the risks. I enlist concrete issues we've >> encountered below, but basically my concern boils down to two philosophical >> points: >> I. Is it too much magic? Lots of things "just work right" in Spark and >> it's extremely convenient and efficient when it indeed works. But should we >> be worried that customization is hard if the built in behavior is not quite >> right for us? Are we to expect hard to track down issues originating from >> the black box behind the magic? >> II. Is it mature enough? E.g. we've created a pull >> request<https://github.com/apache/spark/pull/181>which fixes a problem that >> we were very surprised no one ever stumbled upon >> before. So that's why I'm asking: is Spark being already used in >> professional settings? Can one already trust it being reasonably bug free >> and reliable? >> >> I know I'm asking a biased audience, but that's fine, as I want to be >> convinced. :) >> >> So, to the concrete issues. Sorry for the long mail, and let me know if I >> should break this out into more threads or if there is some other way to >> have this discussion... >> >> 1. Memory management >> The general direction of these questions is whether it's possible to take >> RDD caching related memory management more into our own hands as LRU >> eviction is nice most of the time but can be very suboptimal in some of our >> use cases. >> A. Somehow prioritize cached RDDs, E.g. mark some "essential" that one >> really wants to keep. I'm fine with going down in flames if I mark too much >> data essential. >> B. Memory "reflection": can you pragmatically get the memory size of a >> cached rdd and memory sizes available in total/per executor? If we could do >> this we could indirectly avoid automatic evictions of things we might >> really want to keep in memory. >> C. Evictions caused by RDD partitions on the driver. I had a setup with >> huge worker memory and smallish memory on the driver JVM. To my surprise, >> the system started to cache RDD partitions on the driver as well. As the >> driver ran out of memory I started to see evictions while there were still >> plenty of space on workers. This resulted in lengthy recomputations. Can >> this be avoided somehow? >> D. Broadcasts. Is it possible to get rid of a broadcast manually, without >> waiting for the LRU eviction taking care of it? Can you tell the size of a >> broadcast programmatically? >> >> >> 2. Akka lost connections >> We have quite often experienced lost executors due to akka exceptions - >> mostly connection lost or similar. It seems to happen when an executor gets >> extremely busy with some CPU intensive work. Our hypothesis is that akka >> network threads get starved and the executor fails to respond within >> timeout limits. Is this plausible? If yes, what can we do with it? >> >> In general, these are scary errors in the sense that they come from t
Re: Using Spark for Divide-and-Conquer Algorithms
There is a handy parallelize method for running independent computations. The examples page (http://spark.apache.org/examples.html) on the website uses it to estimate Pi. You can join the results at the end of the parallel calculations. On Fri, Apr 11, 2014 at 7:52 AM, Yanzhe Chen wrote: > Hi all, > > Is Spark suitable for applications like Convex Hull algorithm, which has > some classic divide-and-conquer approaches like QuickHull? > > More generally, Is there a way to express divide-and-conquer algorithms in > Spark? > > Thanks! > > -- > Yanzhe Chen > Institute of Parallel and Distributed Systems > Shanghai Jiao Tong University > Email: yanzhe...@gmail.com > Sent with Sparrow <http://www.sparrowmailapp.com/?sig> > > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: Hybrid GPU CPU computation
I've thought about this idea, although I haven't tried it, but I think the right approach is to pick your granularity boundary and use Spark + JVM for large-scale parts of the algorithm, then use the gpgus API for number crunching large chunks at a time. No need to run the JVM and Spark on the GPU, which would make no sense anyway. Here's another approach: http://www.cakesolutions.net/teamblogs/2013/02/13/akka-and-cuda/ dean On Fri, Apr 11, 2014 at 7:49 AM, Saurabh Jha wrote: > There is a scala implementation for gpgus (nvidia cuda to be precise). but > you also need to port mesos for gpu's. I am not sure about mesos. Also, the > current scala gpu version is not stable to be used commercially. > > Hope this helps. > > Thanks > saurabh. > > > > *Saurabh Jha* > Intl. Exchange Student > School of Computing Engineering > Nanyang Technological University, > Singapore > Web: http://profile.saurabhjha.in > Mob: +65 94663172 > > > On Fri, Apr 11, 2014 at 8:40 PM, Pascal Voitot Dev < > pascal.voitot@gmail.com> wrote: > >> This is a bit crazy :) >> I suppose you would have to run Java code on the GPU! >> I heard there are some funny projects to do that... >> >> Pascal >> >> On Fri, Apr 11, 2014 at 2:38 PM, Jaonary Rabarisoa wrote: >> >>> Hi all, >>> >>> I'm just wondering if hybrid GPU/CPU computation is something that is >>> feasible with spark ? And what should be the best way to do it. >>> >>> >>> Cheers, >>> >>> Jaonary >>> >> >> > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: Do I need to learn Scala for spark ?
I'm doing a talk this week at the Philly ETE conference on Spark. I'll compare the Hadoop Java API and Spark Scala API for implemented the *inverted index* algorithm. I'm going to make the case that the Spark API, like many functional-programming APIs, is so powerful that it's well worth your time to learn it. The Spark Java API certainly strives for this goal, too. Another nice thing about the Scala API, which is also true of the APIs for Scalding and Summingbird, is that you don't have to learn a lot of Scala to use them (at least for a while...). I'll post a link to the slides afterwards. Dean Wampler Typesafe, Inc. @deanwampler On Mon, Apr 21, 2014 at 8:46 AM, Pulasthi Supun Wickramasinghe < pulasthi...@gmail.com> wrote: > Hi, > > I think you can do just fine with your Java knowledge. There is a Java API > that you can use [1]. I am also new to Spark and i have got around with > just my Java knowledge. And Scala is easy to learn if you are good with > Java. > > [1] http://spark.apache.org/docs/latest/java-programming-guide.html > > Regards, > Pulasthi > > > On Mon, Apr 21, 2014 at 7:10 PM, arpan57 wrote: > >> Hi guys, >> I read Spark is pretty faster than Hadoop and that inspires me to learn >> it. >> I've hands on exp. with Hadoop (MR-1). And pretty good with java >> programming. >> Do I need to learn Scala in order to learn Spark ? >> Can I go ahead and write my jobs in Java and run on spark ? >> How much dependency is there on Scala to learn spark ? >> >> Thanks in advance. >> >> Regards, >> Arpan >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Do-I-need-to-learn-Scala-for-spark-tp4528.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > > > > -- > Pulasthi Supun > Undergraduate > Dpt of Computer Science & Engineering > University of Moratuwa > Blog : http://pulasthisupun.blogspot.com/ > Git hub profile: > <http://pulasthisupun.blogspot.com/>https://github.com/pulasthi > <https://github.com/pulasthi> > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: K-means with large K
You might also investigate other clustering algorithms, such as canopy clustering and nearest neighbors. Some of them are less accurate, but more computationally efficient. Often they are used to compute approximate clusters followed by k-means (or a variant thereof) for greater accuracy. dean On Mon, Apr 28, 2014 at 11:41 AM, Buttler, David wrote: > One thing I have used this for was to create codebooks for SIFT features > in images. It is a common, though fairly naïve, method for converting high > dimensional features into a simple word-like features. Thus, if you have > 200 SIFT features for an image, you can reduce that to 200 ‘words’ that can > be directly compared across your entire image set. The drawback is that > usually some parts of the feature space are much more dense than other > parts and distinctive features could be lost. You can try to minimize that > by increasing K, but there are diminishing returns. If you measure the > quality of your clusters in such situations, you will find that the quality > levels off between 1000 and 4000 clusters (at least it did for my SIFT > feature set, YMMV on other data sets). > > > > Dave > > > > *From:* Chester Chen [mailto:chesterxgc...@yahoo.com] > *Sent:* Monday, April 28, 2014 9:31 AM > *To:* user@spark.apache.org > *Cc:* user@spark.apache.org > *Subject:* Re: K-means with large K > > > > David, > > Just curious to know what kind of use cases demand such large k clusters > > > > Chester > > Sent from my iPhone > > > On Apr 28, 2014, at 9:19 AM, "Buttler, David" wrote: > > Hi, > > I am trying to run the K-means code in mllib, and it works very nicely > with small K (less than 1000). However, when I try for a larger K (I am > looking for 2000-4000 clusters), it seems like the code gets part way > through (perhaps just the initialization step) and freezes. The compute > nodes stop doing any CPU / network / IO and nothing happens for hours. I > had done something similar back in the days of Spark 0.6, and I didn’t have > any trouble going up to 4000 clusters with similar data. > > > > This happens with both a standalone cluster, and in local multi-core mode > (with the node given 200GB of heap), but eventually completes in local > single-core mode. > > > > Data statistics: > > Rows: 166248 > > Columns: 108 > > > > This is a test run before trying it out on much larger data > > > > Any ideas on what might be the cause of this? > > > > Thanks, > > Dave > > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
My talk on "Spark: The Next Top (Compute) Model"
I meant to post this last week, but this is a talk I gave at the Philly ETE conf. last week: http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model Also here: http://polyglotprogramming.com/papers/Spark-TheNextTopComputeModel.pdf dean -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: My talk on "Spark: The Next Top (Compute) Model"
Thanks for the clarification. I'll fix the slide. I've done a lot of Scalding/Cascading programming where the two concepts are synonymous, but clearly I was imposing my prejudices here ;) dean On Thu, May 1, 2014 at 8:18 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Cool intro, thanks! One question. On slide 23 it says "Standalone ("local" > mode)". That sounds a bit confusing without hearing the talk. > > Standalone mode is not local. It just does not depend on a cluster > software. I think it's the best mode for EC2/GCE, because they provide a > distributed filesystem anyway (S3/GCS). Why configure Hadoop if you don't > have to. > > > On Thu, May 1, 2014 at 12:25 AM, Dean Wampler wrote: > >> I meant to post this last week, but this is a talk I gave at the Philly >> ETE conf. last week: >> >> http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model >> >> Also here: >> >> http://polyglotprogramming.com/papers/Spark-TheNextTopComputeModel.pdf >> >> dean >> >> -- >> Dean Wampler, Ph.D. >> Typesafe >> @deanwampler >> http://typesafe.com >> http://polyglotprogramming.com >> > > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: My talk on "Spark: The Next Top (Compute) Model"
That's great! Thanks. Let me know if it works ;) or what I could improve to make it work. dean On Thu, May 1, 2014 at 8:45 AM, ZhangYi wrote: > Very Useful material. Currently, I am trying to persuade my client choose > Spark instead of Hadoop MapReduce. Your slide give me more evidence to > support my opinion. > > -- > ZhangYi (张逸) > Developer > tel: 15023157626 > blog: agiledon.github.com > weibo: tw张逸 > Sent with Sparrow <http://www.sparrowmailapp.com/?sig> > > On Thursday, May 1, 2014 at 9:18 PM, Daniel Darabos wrote: > > Cool intro, thanks! One question. On slide 23 it says "Standalone ("local" > mode)". That sounds a bit confusing without hearing the talk. > > Standalone mode is not local. It just does not depend on a cluster > software. I think it's the best mode for EC2/GCE, because they provide a > distributed filesystem anyway (S3/GCS). Why configure Hadoop if you don't > have to. > > > On Thu, May 1, 2014 at 12:25 AM, Dean Wampler wrote: > > I meant to post this last week, but this is a talk I gave at the Philly > ETE conf. last week: > > http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model > > Also here: > > http://polyglotprogramming.com/papers/Spark-TheNextTopComputeModel.pdf > > dean > > -- > Dean Wampler, Ph.D. > Typesafe > @deanwampler > http://typesafe.com > http://polyglotprogramming.com > > > > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: My talk on "Spark: The Next Top (Compute) Model"
I updated the uploads at both locations to fix slide 23. Thanks for the feedback. dean On Thu, May 1, 2014 at 9:25 AM, diplomatic Guru wrote: > Thanks Dean, very useful indeed! > > Best regards, > > Raj > > > On 1 May 2014 14:46, Dean Wampler wrote: > >> That's great! Thanks. Let me know if it works ;) or what I could improve >> to make it work. >> >> dean >> >> >> On Thu, May 1, 2014 at 8:45 AM, ZhangYi wrote: >> >>> Very Useful material. Currently, I am trying to persuade my client >>> choose Spark instead of Hadoop MapReduce. Your slide give me more evidence >>> to support my opinion. >>> >>> -- >>> ZhangYi (张逸) >>> Developer >>> tel: 15023157626 >>> blog: agiledon.github.com >>> weibo: tw张逸 >>> Sent with Sparrow <http://www.sparrowmailapp.com/?sig> >>> >>> On Thursday, May 1, 2014 at 9:18 PM, Daniel Darabos wrote: >>> >>> Cool intro, thanks! One question. On slide 23 it says "Standalone >>> ("local" mode)". That sounds a bit confusing without hearing the talk. >>> >>> Standalone mode is not local. It just does not depend on a cluster >>> software. I think it's the best mode for EC2/GCE, because they provide a >>> distributed filesystem anyway (S3/GCS). Why configure Hadoop if you don't >>> have to. >>> >>> >>> On Thu, May 1, 2014 at 12:25 AM, Dean Wampler wrote: >>> >>> I meant to post this last week, but this is a talk I gave at the Philly >>> ETE conf. last week: >>> >>> http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model >>> >>> Also here: >>> >>> http://polyglotprogramming.com/papers/Spark-TheNextTopComputeModel.pdf >>> >>> dean >>> >>> -- >>> Dean Wampler, Ph.D. >>> Typesafe >>> @deanwampler >>> http://typesafe.com >>> http://polyglotprogramming.com >>> >>> >>> >>> >> >> >> -- >> Dean Wampler, Ph.D. >> Typesafe >> @deanwampler >> http://typesafe.com >> http://polyglotprogramming.com >> > > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: Spark Training
I'm working on a 1-day workshop that I'm giving in Australia next week and a few other conferences later in the year. I'll post a link when it's ready. dean On Thu, May 1, 2014 at 10:30 AM, Denny Lee wrote: > You may also want to check out Paco Nathan's Introduction to Spark > courses: http://liber118.com/pxn/ > > > > On May 1, 2014, at 8:20 AM, Mayur Rustagi wrote: > > Hi Nicholas, > We provide training on spark, hands-on also associated ecosystem. > We gave it recently at a conference in Santa Clara. Primarily its > targetted to novices in Spark ecosystem, to introduce them & hands on to > get them to write simple codes & also queries on Shark. > I think Cloudera also has one which is for Spark, Streaming & MLLib. > > Regards > Mayur > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > > On Thu, May 1, 2014 at 8:42 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> There are many freely-available resources for the enterprising individual >> to use if they want to Spark up their life. >> >> For others, some structured training is in order. Say I want everyone >> from my department at my company to get something like the AMP >> Camp<http://ampcamp.berkeley.edu/>experience, perhaps on-site. >> >> What are my options for that? >> >> Databricks doesn't have a contact page, so I figured this would be the >> next best place to ask. >> >> Nick >> >> >> -- >> View this message in context: Spark >> Training<http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Training-tp5166.html> >> Sent from the Apache Spark User List mailing list >> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at >> Nabble.com. >> > > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: Spark and Java 8
Cloudera customers will need to put pressure on them to support Java 8. They only officially supported Java 7 when Oracle stopped supporting Java 6. dean On Wed, May 7, 2014 at 5:05 AM, Matei Zaharia wrote: > Java 8 support is a feature in Spark, but vendors need to decide for > themselves when they’d like support Java 8 commercially. You can still run > Spark on Java 7 or 6 without taking advantage of the new features (indeed > our builds are always against Java 6). > > Matei > > On May 6, 2014, at 8:59 AM, Ian O'Connell wrote: > > I think the distinction there might be they never said they ran that code > under CDH5, just that spark supports it and spark runs under CDH5. Not that > you can use these features while running under CDH5. > > They could use mesos or the standalone scheduler to run them > > > On Tue, May 6, 2014 at 6:16 AM, Kristoffer Sjögren wrote: > >> Hi >> >> I just read an article [1] about Spark, CDH5 and Java 8 but did not get >> exactly how Spark can run Java 8 on a YARN cluster at runtime. Is Spark >> using a separate JVM that run on data nodes or is it reusing the YARN JVM >> runtime somehow, like hadoop1? >> >> CDH5 only supports Java 7 [2] as far as I know? >> >> Cheers, >> -Kristoffer >> >> >> [1] >> http://blog.cloudera.com/blog/2014/04/making-apache-spark-easier-to-use-in-java-with-java-8/ >> [2] >> http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Requirements-and-Supported-Versions/CDH5-Requirements-and-Supported-Versions.html >> >> >> >> >> > > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: Announcing Spark 1.0.0
Congratulations!! On Fri, May 30, 2014 at 5:12 AM, Patrick Wendell wrote: > I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 > is a milestone release as the first in the 1.0 line of releases, > providing API stability for Spark's core interfaces. > > Spark 1.0.0 is Spark's largest release ever, with contributions from > 117 developers. I'd like to thank everyone involved in this release - > it was truly a community effort with fixes, features, and > optimizations contributed from dozens of organizations. > > This release expands Spark's standard libraries, introducing a new SQL > package (SparkSQL) which lets users integrate SQL queries into > existing Spark workflows. MLlib, Spark's machine learning library, is > expanded with sparse vector support and several new algorithms. The > GraphX and Streaming libraries also introduce new features and > optimizations. Spark's core engine adds support for secured YARN > clusters, a unified tool for submitting Spark applications, and > several performance and stability improvements. Finally, Spark adds > support for Java 8 lambda syntax and improves coverage of the Java and > Python API's. > > Those features only scratch the surface - check out the release notes here: > http://spark.apache.org/releases/spark-release-1-0-0.html > > Note that since release artifacts were posted recently, certain > mirrors may not have working downloads for a few hours. > > - Patrick > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com