Re: s3 vfs on Mesos Slaves
Might I ask why vfs? I'm new to vfs and not sure wether or not it predates the hadoop file system interfaces (HCFS). After all spark natively supports any HCFS by leveraging the hadoop FileSystem api and class loaders and so on. So simply putting those resources on your classpath should be sufficient to directly connect to s3. By using the sc.hadoopFile (...) commands. On May 13, 2015 12:16 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you happened to have a look at this https://github.com/abashev/vfs-s3 Thanks Best Regards On Tue, May 12, 2015 at 11:33 PM, Stephen Carman scar...@coldlight.com wrote: We have a small mesos cluster and these slaves need to have a vfs setup on them so that the slaves can pull down the data they need from S3 when spark runs. There doesn’t seem to be any obvious way online on how to do this or how easily accomplish this. Does anyone have some best practices or some ideas about how to accomplish this? An example stack trace when a job is ran on the mesos cluster… Any idea how to get this going? Like somehow bootstrapping spark on run or something? Thanks, Steve java.io.IOException: Unsupported scheme s3n for URI s3n://removed at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43) at com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465) at com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42) at com.coldlight.neuron.data.ClquetPartitionedData$Iter.init(ClquetPartitionedData.java:330) at com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 15/05/12 13:57:51 ERROR Executor: Exception in task 0.1 in stage 0.0 (TID 1) java.lang.RuntimeException: java.io.IOException: Unsupported scheme s3n for URI s3n://removed at com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:307) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Unsupported scheme s3n for URI s3n://removed at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43) at com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465) at com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42) at com.coldlight.neuron.data.ClquetPartitionedData$Iter.init(ClquetPartitionedData.java:330) at com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304) ... 8 more This e-mail is intended solely for the above-mentioned recipient and it may contain confidential or privileged information. If you have received it in error, please notify us immediately and delete the e-mail. You must not copy, distribute, disclose or take any action in reliance on it. In addition, the contents of an attachment to this e-mail may contain software viruses which could damage your own computer system. While ColdLight Solutions, LLC has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should perform your own virus checks before opening the attachment.
Re: Keep or remove Debian packaging in Spark?
and in recent conversations I didn't hear dissent to the idea of removing this. Is this still useful enough to fix up? All else equal I'd like to start to walk back some of the complexity of the build, but I don't know how all-else-equal it is. Certainly, it sounds like nobody intends these to be used to actually deploy Spark. I don't doubt it's useful to someone, but can they maintain the packaging logic elsewhere? -- --- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- jay vyas
spark akka fork : is the source anywhere?
Hi spark. Where is akka coming from in spark ? I see the distribution referenced is a spark artifact... but not in the apache namespace. akka.grouporg.spark-project.akka/akka.group akka.version2.3.4-spark/akka.version Clearly this is a deliberate thought out change (See SPARK-1812), but its not clear where 2.3.4 spark is coming from and who is maintaining its release? -- jay vyas PS I've had some conversations with will benton as well about this, and its clear that some modifications to akka are needed, or else a protobug error occurs, which amount to serialization incompatibilities, hence if one wants to build spark from sources, the patched akka is required (or else, manual patching needs to be done)... 15/01/28 22:58:10 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkWorker-akka.remote.default-remote-dispatcher-6] shutting down ActorSystem [sparkWorker] java.lang.VerifyError: class akka.remote.WireFormats$AkkaControlMessage overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
Re: Standardized Spark dev environment
I can comment on both... hi will and nate :) 1) Will's Dockerfile solution is the most simple direct solution to the dev environment question : its a efficient way to build and develop spark environments for dev/test.. It would be cool to put that Dockerfile (and/or maybe a shell script which uses it) in the top level of spark as the build entry point. For total platform portability, u could wrap in a vagrantfile to launch a lightweight vm, so that windows worked equally well. 2) However, since nate mentioned vagrant and bigtop, i have to chime in :) the vagrant recipes in bigtop are a nice reference deployment of how to deploy spark in a heterogenous hadoop style environment, and tighter integration testing w/ bigtop for spark releases would be lovely ! The vagrant stuff use puppet to deploy an n node VM or docker based cluster, in which users can easily select components (including spark,yarn,hbase,hadoop,etc...) by simnply editing a YAML file : https://github.com/apache/bigtop/blob/master/bigtop-deploy/vm/vagrant-puppet/vagrantconfig.yaml As nate said, it would be alot of fun to get more cross collaboration between the spark and bigtop communities. Input on how we can better integrate spark (wether its spork, hbase integration, smoke tests aroudn the mllib stuff, or whatever, is always welcome ) On Tue, Jan 20, 2015 at 10:21 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: How many profiles (hadoop / hive /scala) would this development environment support ? As many as we want. We probably want to cover a good chunk of the build matrix https://issues.apache.org/jira/browse/SPARK-2004 that Spark officially supports. What does this provide, concretely? It provides a reliable way to create a “good” Spark development environment. Roughly speaking, this probably should mean an environment that matches Jenkins, since that’s where we run “official” testing and builds. For example, Spark has to run on Java 6 and Python 2.6. When devs build and run Spark locally, we can make sure they’re doing it on these versions of the languages with a simple vagrant up. Nate, could you comment on how something like this would relate to the Bigtop effort? http://chapeau.freevariable.com/2014/08/jvm-test-docker.html Will, that’s pretty sweet. I tried something similar a few months ago as an experiment to try building/testing Spark within a container. Here’s the shell script I used https://gist.github.com/nchammas/60b04141f3b9f053faaa against the base CentOS Docker image to setup an environment ready to build and test Spark. We want to run Spark unit tests within containers on Jenkins, so it might make sense to develop a single Docker image that can be used as both a “dev environment” as well as execution container on Jenkins. Perhaps that’s the approach to take instead of looking into Vagrant. Nick On Tue Jan 20 2015 at 8:22:41 PM Will Benton wi...@redhat.com wrote: Hey Nick, I did something similar with a Docker image last summer; I haven't updated the images to cache the dependencies for the current Spark master, but it would be trivial to do so: http://chapeau.freevariable.com/2014/08/jvm-test-docker.html best, wb - Original Message - From: Nicholas Chammas nicholas.cham...@gmail.com To: Spark dev list dev@spark.apache.org Sent: Tuesday, January 20, 2015 6:13:31 PM Subject: Standardized Spark dev environment What do y'all think of creating a standardized Spark development environment, perhaps encoded as a Vagrantfile, and publishing it under `dev/`? The goal would be to make it easier for new developers to get started with all the right configs and tools pre-installed. If we use something like Vagrant, we may even be able to make it so that a single Vagrantfile creates equivalent development environments across OS X, Linux, and Windows, without having to do much (or any) OS-specific work. I imagine for committers and regular contributors, this exercise may seem pointless, since y'all are probably already very comfortable with your workflow. I wonder, though, if any of you think this would be worthwhile as a improvement to the new Spark developer experience. Nick -- jay vyas
Re: EndpointWriter : Dropping message failure ReliableDeliverySupervisor errors...
Hi folks. In the end, I found that the problem was that I was using IP Addresses instead of hostnames. I guess, maybe, reverse dns is a requirement for spark slave - master communications... ? On Fri, Dec 19, 2014 at 7:21 PM, jay vyas jayunit100.apa...@gmail.com wrote: Hi spark. Im trying to understand the akka debug messages when networking doesnt work properly. any hints would be great on this. SIMPLE TESTS I RAN - i tried a ping, works. - i tried a telnet to the 7077 port of master, from slave, also works. LOGS 1) On the master I see this WARN log buried: ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@s2.docker:45477] has failed, address is now gated for [500] ms Reason is: [Disassociated]. 2) I also see a periodic, repeated ERROR message : ERROR EndpointWriter: dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp:// sparkMaster@172.17.0.12:7077 Any idea what these folks mean? From what i can tel, i can telnet from s2.docker to my master server. Any thoughts for more debugging of this would be appreciated! im out of ideas for the time being -- jay vyas -- jay vyas
EndpointWriter : Dropping message failure ReliableDeliverySupervisor errors...
Hi spark. Im trying to understand the akka debug messages when networking doesnt work properly. any hints would be great on this. SIMPLE TESTS I RAN - i tried a ping, works. - i tried a telnet to the 7077 port of master, from slave, also works. LOGS 1) On the master I see this WARN log buried: ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@s2.docker:45477] has failed, address is now gated for [500] ms Reason is: [Disassociated]. 2) I also see a periodic, repeated ERROR message : ERROR EndpointWriter: dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp:// sparkMaster@172.17.0.12:7077 Any idea what these folks mean? From what i can tel, i can telnet from s2.docker to my master server. Any thoughts for more debugging of this would be appreciated! im out of ideas for the time being -- jay vyas
Is there a way for scala compiler to catch unserializable app code?
This is more a curiosity than an immediate problem. Here is my question: I ran into this easily solved issue http://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou recently. The solution was to replace my class with a scala singleton, which i guess is readily serializable. So its clear that spark needs to serialize objects which carry the driver methods for an app, in order to run... but I'm wondering,,, maybe there is a way to change or update the spark API to catch unserializable spark apps at compile time? -- jay vyas
Re: best IDE for scala + spark development?
I tried the scala eclipse ide but in scala 2.10 I ran into some weird issues http://stackoverflow.com/questions/24253084/scalaide-and-cryptic-classnotfound-errors ... So I switched to IntelliJ and was much more satisfied... I've written a post on how I use fedora,sbt, and intellij for spark apps. http://jayunit100.blogspot.com/2014/07/set-up-spark-application-devleopment.html?m=1 The IntelliJ sbt plugin is imo less buggy then the eclipse scalaIDE stuff. For example, I found I had to set some special preferences Finally... given sbts automated recompile option, if you just use tmux, and vim nerdtree, with sbt , you could come pretty close to something like an IDE without all the drama .. On Oct 26, 2014, at 11:07 AM, ll duy.huynh@gmail.com wrote: i'm new to both scala and spark. what IDE / dev environment do you find most productive for writing code in scala with spark? is it just vim + sbt? or does a full IDE like intellij works out better? thanks! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/best-IDE-for-scala-spark-development-tp8965.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
OutOfMemoryError when running sbt/sbt test
Hi spark. I've been trying to build spark, but I've been getting lots of oome exceptions. https://gist.github.com/jayunit100/d424b6b825ce8517d68c For the most part, they are of the form: java.lang.OutOfMemoryError: unable to create new native thread I've attempted to hard code the get_mem_opts function, which is in the sbt-launch-lib.bash file, to have various very high parameter sizes (i.e. -Xms5g) with high MaxPermSize, etc... and to no avail. Any thoughts on this would be appreciated. I know of others having the same problem as well. Thanks! -- jay vyas
Re: OutOfMemoryError when running sbt/sbt test
Thanks...! Some questions below. 1) you are suggesting that maybe this OOME is a symptom/red herring , and the true cause of it is that a thread can't span because of ulimit... If so possibly this could be flagged early on in the build. And -- where are so many threads coming from that I need to up my limit? Is this a new feature added to spark recently, and if so will it effect deployments scenarios as well? And 2) possibly SBT_OPTS is where the memory settings should be ? If so, then why do we have the get_mem_opts wrapper function coded to send memory manually as Xmx/Xms options? execRunner $java_cmd \ ${SBT_OPTS:-$default_sbt_opts} \ $(get_mem_opts $sbt_mem) \ ${java_opts} \ ${java_args[@]} \ -jar $sbt_jar \ ${sbt_commands[@]} \ ${residual_args[@]} On Aug 26, 2014, at 8:58 PM, Mubarak Seyed spark.devu...@gmail.com wrote: What is your ulimit value? On Tue, Aug 26, 2014 at 5:49 PM, jay vyas jayunit100.apa...@gmail.com wrote: Hi spark. I've been trying to build spark, but I've been getting lots of oome exceptions. https://gist.github.com/jayunit100/d424b6b825ce8517d68c For the most part, they are of the form: java.lang.OutOfMemoryError: unable to create new native thread I've attempted to hard code the get_mem_opts function, which is in the sbt-launch-lib.bash file, to have various very high parameter sizes (i.e. -Xms5g) with high MaxPermSize, etc... and to no avail. Any thoughts on this would be appreciated. I know of others having the same problem as well. Thanks! -- jay vyas