Re: Java 8 vs Scala
We have a complex application that runs productively for couple of months and heavily uses spark in scala. Just to give you some insight on complexity - we do not have such a huge source data (only about 500'000 complex elements), but we have more than a billion transformations and intermediate data elements we do with our machine learning algorithms. Our current spark/mesos cluster consists of 120 CPUs, 190 GB RAM and plenty of HDD space. Now regarding your question: - scala is just a beautiful language itself, it has nothing to do with spark; - spark api fits very naturally into scala semantics because of the map/reduce transformations are written more or less identicaly for local collections and RDDs; - as with any religious topic, there is controverse discussion on what language is better and most of the times (I have read quite a lot of blog/forum topics on this) argumentation is based on what religion one belongs to (e.g. Java vs Scala vs Python) - we have checked supposed performance issues and limitations of scala described here: (http://www.infoq.com/news/2011/11/yammer-scala) by re-factoring to "best practices" described in the article and have observed both performance increase in some places and, at the same time, performance decrease in other places. Thus I would say there is no noticeable performance difference between scala vs java in our use case (of course there are and always will be applications where one or other language performs better); hope I could help reinis On 15.07.2015 09:27, 诺铁 wrote: I think different team got different answer for this question. my team use scala, and happy with it. On Wed, Jul 15, 2015 at 1:31 PM, Tristan Blakers mailto:tris...@blackfrog.org>> wrote: We have had excellent results operating on RDDs using Java 8 with Lambdas. It’s slightly more verbose than Scala, but I haven’t found this an issue, and haven’t missed any functionality. The new DataFrame API makes the Spark platform even more language agnostic. Tristan On 15 July 2015 at 06:40, Vineel Yalamarthy mailto:vineelyalamar...@gmail.com>> wrote: Good question. Like you , many are in the same boat(coming from Java background). Looking forward to response from the community. Regards Vineel On Tue, Jul 14, 2015 at 2:30 PM, spark user mailto:spark_u...@yahoo.com.invalid>> wrote: Hi All To Start new project in Spark , which technology is good .Java8 OR Scala . I am Java developer , Can i start with Java 8 or I Need to learn Scala . which one is better technology for quick start any POC project Thanks - su -- Thanks and Regards, Venkata Vineel, Student ,School of Computing Mobile : +1-385-2109-788 -/Innovation is the ability to convert //ideas into invoice*s*/
Re: Tasks randomly stall when running on mesos
Hi, I just configured my cluster to run with 1.4.0-rc2, alas the dependency jungle does not one let just download, config and start. Instead one will have to fiddle with sbt settings for the upcoming couple of nights: 2015-05-26 14:50:52,686 WARN a.r.ReliableDeliverySupervisor - Association with remote system [akka.tcp://driverPropsFetcher@app03:44805] has failed, address is now gated for [5000] ms. Reason is: [org.apache.spark.rpc.akka.AkkaMessage]. 2015-05-26 14:52:55,707 ERROR Remoting - org.apache.spark.rpc.akka.AkkaMessage java.lang.ClassNotFoundException: org.apache.spark.rpc.akka.AkkaMessage at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:626) at akka.util.ClassLoaderObjectInputStream.resolveClass(ClassLoaderObjectInputStream.scala:19) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136) at akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104) at scala.util.Try$.apply(Try.scala:161) at akka.serialization.Serialization.deserialize(Serialization.scala:98) at akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23) at akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58) at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58) at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76) at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) kind regards reinis On 25.05.2015 23:09, Reinis Vicups wrote: Great hints, you guys! Yes spark-shell worked fine with mesos as master. I haven't tried to execute multiple rdd actions in a row though (I did couple of successful counts on hbase tables i am working with in several experiments but nothing that would compare to the stuff my spark jobs are doing), but will check if shell stalls upon some decent rdd action. Also thanks a bunch for the links to binaries. This will literally save me hours! kind regards reinis On 25.05.2015 21:00, Dean Wampler wrote: Here is a link for builds of 1.4 RC2: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/ <http://people.apache.org/%7Epwendell/spark-releases/spark-1.4.0-rc2-bin/> For a mvn repo, I believe the RC2 artifacts are here: https://repository.apache.org/content/repositories/orgapachespark-1104/ A few experiments you might try: 1. Does spark-shell work? It might start fine, but make sure you can create an RDD and use it, e.g., something like: val rdd = sc.parallelize(Seq(1,2,3,4,5,6)) rdd foreach println 2. Try coarse grained mode, which has different logic for executor management. You can set it in $SPARK_HOME/conf/spark-defaults.conf file: spark.mesos.coarse true Or, from this page <http://spark.apache.org/docs/latest/running-on-mesos.html>, set the property in a SparkConf object used to construct the SparkContext: conf.set("spark.mesos.coarse", "true") dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <ht
Re: Tasks randomly stall when running on mesos
Great hints, you guys! Yes spark-shell worked fine with mesos as master. I haven't tried to execute multiple rdd actions in a row though (I did couple of successful counts on hbase tables i am working with in several experiments but nothing that would compare to the stuff my spark jobs are doing), but will check if shell stalls upon some decent rdd action. Also thanks a bunch for the links to binaries. This will literally save me hours! kind regards reinis On 25.05.2015 21:00, Dean Wampler wrote: Here is a link for builds of 1.4 RC2: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/ <http://people.apache.org/%7Epwendell/spark-releases/spark-1.4.0-rc2-bin/> For a mvn repo, I believe the RC2 artifacts are here: https://repository.apache.org/content/repositories/orgapachespark-1104/ A few experiments you might try: 1. Does spark-shell work? It might start fine, but make sure you can create an RDD and use it, e.g., something like: val rdd = sc.parallelize(Seq(1,2,3,4,5,6)) rdd foreach println 2. Try coarse grained mode, which has different logic for executor management. You can set it in $SPARK_HOME/conf/spark-defaults.conf file: spark.mesos.coarse true Or, from this page <http://spark.apache.org/docs/latest/running-on-mesos.html>, set the property in a SparkConf object used to construct the SparkContext: conf.set("spark.mesos.coarse", "true") dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Mon, May 25, 2015 at 12:06 PM, Reinis Vicups <mailto:sp...@orbit-x.de>> wrote: Hello, I assume I am running spark in a fine-grained mode since I haven't changed the default here. One question regarding 1.4.0-RC1 - is there a mvn snapshot repository I could use for my project config? (I know that I have to download source and make-distribution for executor as well) thanks reinis On 25.05.2015 17:07, Iulian Dragoș wrote: On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups mailto:sp...@orbit-x.de>> wrote: Hello, I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and running on a cluster with 3 nodes on 64bit ubuntu. My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only with this combination I have succeeded to run spark-jobs on mesos at all. Different versions are causing class loader issues. I am submitting spark jobs with spark-submit with mesos://zk://.../mesos. Are you using coarse grained or fine grained mode? sandbox log of slave-node app01 (the one that stalls) shows following: 10:01:25.815506 35409 fetcher.cpp:214] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:26.497764 35409 fetcher.cpp:99] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' using Hadoop Client 10:01:26.497869 35409 fetcher.cpp:109] Downloading resource from 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' to '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:32.877717 35409 fetcher.cpp:78] Extracted resource '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' into '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05' Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 10:01:34 INFO MesosExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0 *10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor was null* 10:01:34.540870 35765 exec.cpp:206] Executor registered on slave 20150511-150924-3410235146-5050-1903-S3 10:01:34 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20150511-150924-3410235146-5050-1903-S3 with 1 cpus It looks like an inconsistent state on the Mesos scheduler. It tries to launch a task on a given slave before the
Re: Tasks randomly stall when running on mesos
Hello, I assume I am running spark in a fine-grained mode since I haven't changed the default here. One question regarding 1.4.0-RC1 - is there a mvn snapshot repository I could use for my project config? (I know that I have to download source and make-distribution for executor as well) thanks reinis On 25.05.2015 17:07, Iulian Dragoș wrote: On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups <mailto:sp...@orbit-x.de>> wrote: Hello, I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and running on a cluster with 3 nodes on 64bit ubuntu. My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only with this combination I have succeeded to run spark-jobs on mesos at all. Different versions are causing class loader issues. I am submitting spark jobs with spark-submit with mesos://zk://.../mesos. Are you using coarse grained or fine grained mode? sandbox log of slave-node app01 (the one that stalls) shows following: 10:01:25.815506 35409 fetcher.cpp:214] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:26.497764 35409 fetcher.cpp:99] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' using Hadoop Client 10:01:26.497869 35409 fetcher.cpp:109] Downloading resource from 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' to '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:32.877717 35409 fetcher.cpp:78] Extracted resource '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' into '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05' Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 10:01:34 INFO MesosExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0 *10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor was null* 10:01:34.540870 35765 exec.cpp:206] Executor registered on slave 20150511-150924-3410235146-5050-1903-S3 10:01:34 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20150511-150924-3410235146-5050-1903-S3 with 1 cpus It looks like an inconsistent state on the Mesos scheduler. It tries to launch a task on a given slave before the executor has registered. This code was improved/refactored in 1.4, could you try 1.4.0-RC1? iulian 10:01:34 INFO SecurityManager: Changing view acls to... 10:01:35 INFO Slf4jLogger: Slf4jLogger started 10:01:35 INFO Remoting: Starting remoting 10:01:35 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@app01:xxx] 10:01:35 INFO Utils: Successfully started service 'sparkExecutor' on port xxx. 10:01:35 INFO AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver@dev-web01/user/MapOutputTracker 10:01:35 INFO AkkaUtils: Connecting to BlockManagerMaster: akka.tcp://sparkDriver@dev-web01/user/BlockManagerMaster 10:01:36 INFO DiskBlockManager: Created local directory at /tmp/spark-52a6585a-f9f2-4ab6-bebc-76be99b0c51c/blockmgr-e6d79818-fe30-4b5c-bcd6-8fbc5a201252 10:01:36 INFO MemoryStore: MemoryStore started with capacity 88.3 MB 10:01:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 10:01:36 INFO AkkaUtils: Connecting to OutputCommitCoordinator: akka.tcp://sparkDriver@dev-web01/user/OutputCommitCoordinator 10:01:36 INFO Executor: Starting executor ID 20150511-150924-3410235146-5050-1903-S3 on host app01 10:01:36 INFO NettyBlockTransferService: Server created on XXX 10:01:36 INFO BlockManagerMaster: Trying to register BlockManager 10:01:36 INFO BlockManagerMaster: Registered BlockManager 10:01:36 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@dev-web01/user/HeartbeatReceiver As soon as spark-driver is aborted, following log entries are added to the sandbox log of slave-node app01: 10:17:29.559433 35772 exec.cpp:379] Executor asked to shutdown 10:17:29 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@dev-web01] has failed, address is now gated for [5000] ms. R
Tasks randomly stall when running on mesos
Hello, I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and running on a cluster with 3 nodes on 64bit ubuntu. My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only with this combination I have succeeded to run spark-jobs on mesos at all. Different versions are causing class loader issues. I am submitting spark jobs with spark-submit with mesos://zk://.../mesos. About 50% of all jobs stall forever (or until I kill spark driver). Error occurs randomly on different slave-nodes. It happens that 4 spark-job in a row run completely without problems and then problem suddenly occurs. I am always testing same set of 5 different jobs single and combined and the error occurs always in different job/node/stage/task combinations. Whenever a slave-node stalls, this message appears in sandbox-log of the failing slave: 10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor was null Any hints on how to address this issue are greatly appreciated kind regards reinis Job that stalls, shows following in spark-driver log (As one can see - the task 1.0 is never finished): 10:01:25,620 INFO o.a.s.s.DAGScheduler - Submitting 4 missing tasks from Stage 0 (MapPartitionsRDD[1] at groupBy at ImportExtensionFieldsSparkJob.scala:57) 10:01:25,621 INFO o.a.s.s.TaskSchedulerImpl - Adding task set 0.0 with 4 tasks 10:01:25,656 INFO o.a.s.s.TaskSetManager - Starting task 0.0 in stage 0.0 (TID 0, app03, PROCESS_LOCAL, 1140 bytes) 10:01:25,660 INFO o.a.s.s.TaskSetManager - Starting task 1.0 in stage 0.0 (TID 1, app01, PROCESS_LOCAL, 1140 bytes) 10:01:25,661 INFO o.a.s.s.TaskSetManager - Starting task 2.0 in stage 0.0 (TID 2, app02, PROCESS_LOCAL, 1140 bytes) 10:01:25,662 INFO o.a.s.s.TaskSetManager - Starting task 3.0 in stage 0.0 (TID 3, app03, PROCESS_LOCAL, 1140 bytes) 10:01:36,842 INFO o.a.s.s.BlockManagerMasterActor - Registering block manager app02 with 88.3 MB RAM, BlockManagerId(20150511-150924-3410235146-5050-1903-S1, app02, 59622) 10:01:36,862 INFO o.a.s.s.BlockManagerMasterActor - Registering block manager app03 with 88.3 MB RAM, BlockManagerId(20150511-150924-3410235146-5050-1903-S2, app03, 39420) 10:01:36,917 INFO o.a.s.s.BlockManagerMasterActor - Registering block manager app01 with 88.3 MB RAM, BlockManagerId(20150511-150924-3410235146-5050-1903-S3, app01, 45605) 10:01:38,701 INFO o.a.s.s.BlockManagerInfo - Added broadcast_2_piece0 in memory on app03 (size: 2.6 KB, free: 88.3 MB) 10:01:38,702 INFO o.a.s.s.BlockManagerInfo - Added broadcast_2_piece0 in memory on app02 (size: 2.6 KB, free: 88.3 MB) 10:01:41,400 INFO o.a.s.s.TaskSetManager - Finished task 0.0 in stage 0.0 (TID 0) in 15721 ms on app03 (1/4) 10:01:41,539 INFO o.a.s.s.TaskSetManager - Finished task 2.0 in stage 0.0 (TID 2) in 15870 ms on app02 (2/4) 10:01:41,697 INFO o.a.s.s.TaskSetManager - Finished task 3.0 in stage 0.0 (TID 3) in 16029 ms on app03 (3/4) sandbox log of slave-node app01 (the one that stalls) shows following: 10:01:25.815506 35409 fetcher.cpp:214] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:26.497764 35409 fetcher.cpp:99] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' using Hadoop Client 10:01:26.497869 35409 fetcher.cpp:109] Downloading resource from 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' to '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:32.877717 35409 fetcher.cpp:78] Extracted resource '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' into '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05' Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 10:01:34 INFO MesosExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0 *10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor was null* 10:01:34.540870 35765 exec.cpp:206] Executor registered on slave 20150511-150924-3410235146-5050-1903-S3 10:01:34 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20150511-150924-3410235146-5050-1903-S3 with 1 cpus 10:01:34 INFO SecurityManager: Changing view acls to... 10:01:35 INFO Slf4jLogger: Slf4jLogger started 10:01:35 INFO Remoting: Starting remoting 10:01:35 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@app01:xxx] 10:01:35 INFO Utils:
Spark 1.1.0: weird spark-shell behavior
Hello, I have two weird effects when working with spark-shell: 1. This code executed in spark-shell causes an exception below. At the same time it works perfectly when submitted with spark-submit! : import org.apache.hadoop.hbase.{HConstants, HBaseConfiguration} import org.apache.hadoop.hbase.mapreduce.TableInputFormat import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.client.Result import org.apache.mahout.math.VectorWritable import com.google.common.io.ByteStreams import org.apache.hadoop.hbase.util.Bytes import org.apache.spark.SparkContext.rddToPairRDDFunctions import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} val hConf = HBaseConfiguration.create() hConf.set("hbase.defaults.for.version.skip", "true") hConf.set("hbase.defaults.for.version", "0.98.6-cdh5.2.0") hConf.set(HConstants.ZOOKEEPER_QUORUM, "myserv") hConf.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181") hConf.set(TableInputFormat.INPUT_TABLE, "MyNS:MyTable") val rdd = sc.newAPIHadoopRDD(hConf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result]) rdd.count() --- Exception --- 14/12/01 10:45:24 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.ExceptionInInitializerError at org.apache.hadoop.hbase.client.HTable.(HTable.java:197) at org.apache.hadoop.hbase.client.HTable.(HTable.java:159) at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:113) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: hbase-default.xml file seems to be for and old version of HBase (null), this version is 0.98.6-cdh5.2.0 at org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:73) at org.apache.hadoop.hbase.HBaseConfiguration.addHbaseResources(HBaseConfiguration.java:105) at org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:116) at org.apache.hadoop.hbase.client.HConnectionManager.(HConnectionManager.java:222) ... 14 more We have already checked most of the trivial stuff with class paths and existenceof tables and column groups, enabled HBase specific settings to avoid the version checking and so on. It appears that the supplied HBase configuration is completely ignored by context. We tried to solve this issue by instantiating own spark context and encountered the second weird effect: 2. when attempting to instantiate own SparkContext we get an exception below: // imports block ... |val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) --- Exception --- 2014-12-01 10:42:24,966 WARN o.e.j.u.c.AbstractLifeCycle - FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Die Adresse wird bereits verwendet java.net.BindException: Die Adresse wird bereits verwendet at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:444) at sun.nio.ch.Net.bind(Net.java:436) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187) at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316) at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.server.Server.doStart(Server.java:293) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:199) at org.apache.spark.ui.JettyUtils$$anonfun$4.apply(JettyUtils.scala:209) at org.apache.spark.ui.JettyUtils$$anonfun$4.apply(JettyUtils.scala:209) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1449) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
Re: HBase 0.96+ with Spark 1.0+
I am humbly bumping this since even after another week of trying I haven't had luck to fix this yet. On 14.09.2014 19:21, Reinis Vicups wrote: I did actually try Seans suggestion just before I posted for the first time in this thread. I got an error when doing this and thought that I am not understanding what Sean was suggesting. Now I re-attempted your suggestions with spark 1.0.0-cdh5.1.0, hbase 0.98.1-cdh5.1.0 and hadoop 2.3.0-cdh5.1.0 I am currently using. I used following: val mortbayEnforce = "org.mortbay.jetty" % "servlet-api" % "3.0.20100224" val mortbayExclusion = ExclusionRule(organization = "org.mortbay.jetty", name = "servlet-api-2.5") and applied this to hadoop and hbase dependencies e.g. like this: val hbase = Seq(HBase.server, HBase.common, HBase.compat, HBase.compat2, HBase.protocol, mortbayEnforce).map(_.excludeAll(HBase.exclusions: _*)) private object HBase { val server = "org.apache.hbase" % "hbase-server" % Version.HBase ... val exclusions = Seq(ExclusionRule("org.apache.ant"), mortbayExclusion) } I still get the error I got the last time I tried this experiment: 14/09/14 18:28:09 ERROR metrics.MetricsSystem: Sink class org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130) at org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:84) at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230) at org.apache.spark.SparkContext.(SparkContext.scala:202) at d.s.f.s.t.SimpleTicketTextSimilaritySparkJobSpec$$anonfun$1.apply$mcV$sp(SimpleTicketTextSimilaritySparkJobSpec.scala:29) at d.s.f.s.t.SimpleTicketTextSimilaritySparkJobSpec$$anonfun$1.apply(SimpleTicketTextSimilaritySparkJobSpec.scala:21) at d.s.f.s.t.SimpleTicketTextSimilaritySparkJobSpec$$anonfun$1.apply(SimpleTicketTextSimilaritySparkJobSpec.scala:21) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1683) at org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1644) at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656) at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1656) at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1683) at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1714) at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1714) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FlatSpecLike$class.runTests(FlatSpecLike.scala:1714) at org.scalatest.FlatSpec.runTests(FlatSpec.scala:1683) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FlatSpec.org$scalatest$FlatSpecLike$$s
Re: HBase 0.96+ with Spark 1.0+
I did actually try Seans suggestion just before I posted for the first time in this thread. I got an error when doing this and thought that I am not understanding what Sean was suggesting. Now I re-attempted your suggestions with spark 1.0.0-cdh5.1.0, hbase 0.98.1-cdh5.1.0 and hadoop 2.3.0-cdh5.1.0 I am currently using. I used following: val mortbayEnforce = "org.mortbay.jetty" % "servlet-api" % "3.0.20100224" val mortbayExclusion = ExclusionRule(organization = "org.mortbay.jetty", name = "servlet-api-2.5") and applied this to hadoop and hbase dependencies e.g. like this: val hbase = Seq(HBase.server, HBase.common, HBase.compat, HBase.compat2, HBase.protocol, mortbayEnforce).map(_.excludeAll(HBase.exclusions: _*)) private object HBase { val server = "org.apache.hbase" % "hbase-server" % Version.HBase ... val exclusions = Seq(ExclusionRule("org.apache.ant"), mortbayExclusion) } I still get the error I got the last time I tried this experiment: 14/09/14 18:28:09 ERROR metrics.MetricsSystem: Sink class org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130) at org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:84) at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230) at org.apache.spark.SparkContext.(SparkContext.scala:202) at d.s.f.s.t.SimpleTicketTextSimilaritySparkJobSpec$$anonfun$1.apply$mcV$sp(SimpleTicketTextSimilaritySparkJobSpec.scala:29) at d.s.f.s.t.SimpleTicketTextSimilaritySparkJobSpec$$anonfun$1.apply(SimpleTicketTextSimilaritySparkJobSpec.scala:21) at d.s.f.s.t.SimpleTicketTextSimilaritySparkJobSpec$$anonfun$1.apply(SimpleTicketTextSimilaritySparkJobSpec.scala:21) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1683) at org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1644) at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656) at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1656) at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1683) at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1714) at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1714) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FlatSpecLike$class.runTests(FlatSpecLike.scala:1714) at org.scalatest.FlatSpec.runTests(FlatSpec.scala:1683) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FlatSpec.org$scalatest$FlatSpecLike$$super$run(FlatSpec.scala:1683) at org.scalatest.FlatSpecLike$$anonfun$run$1.apply(FlatSpecLike.scala:1760) at org.scalatest.FlatSpecLike$$anonfun$run$1.apply(FlatSpecLike.scala:1760) at org.scalatest.SuperEngine.runImpl(Engine.scala