Re: mapwithstate Hangs with Error cleaning broadcast
You are quite right. I am getting this error while profiling my module to see what is the minimum resources I can use to achieve my SLA. My point is that if resource constraint creates this problem, then this issue is just waiting to happen in a larger scenario(Though the probability of happening will be less) I hope to get some guidance as to what parameter I can use in order to totally avoid this issue. I am guessing spark.shuffle.io.preferDirectBufs = false but I am not sure. ..Manas On Tue, Mar 15, 2016 at 2:30 PM, Iain Cundy <iain.cu...@amdocs.com> wrote: > Hi Manas > > > > I saw a very similar problem while using mapWithState. Timeout on > BlockManager remove leading to a stall. > > > > In my case it only occurred when there was a big backlog of micro-batches, > combined with a shortage of memory. The adding and removing of blocks > between new and old tasks was interleaved. Don’t really know what caused > it. Once I fixed the problems that were causing the backlog – in my case > state compaction not working with Kryo in 1.6.0 (with Kryo workaround > rather than patch) – I’ve never seen it again. > > > > So if you’ve got a backlog or other issue to fix maybe you’ll get lucky > too J. > > > > Cheers > > Iain > > > > *From:* manas kar [mailto:poorinsp...@gmail.com] > *Sent:* 15 March 2016 14:49 > *To:* Ted Yu > *Cc:* user > *Subject:* [MARKETING] Re: mapwithstate Hangs with Error cleaning > broadcast > > > > I am using spark 1.6. > > I am not using any broadcast variable. > > This broadcast variable is probably used by the state management of > mapwithState > > > > ...Manas > > > > On Tue, Mar 15, 2016 at 10:40 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > Which version of Spark are you using ? > > > > Can you show the code snippet w.r.t. broadcast variable ? > > > > Thanks > > > > On Tue, Mar 15, 2016 at 6:04 AM, manasdebashiskar <poorinsp...@gmail.com> > wrote: > > Hi, > I have a streaming application that takes data from a kafka topic and uses > mapwithstate. > After couple of hours of smooth running of the application I see a problem > that seems to have stalled my application. > The batch seems to have been stuck after the following error popped up. > Has anyone seen this error or know what causes it? > 14/03/2016 21:41:13,295 ERROR org.apache.spark.ContextCleaner: 95 - Error > cleaning broadcast 7456 > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 > seconds]. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org > $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) > at > > org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136) > at > > org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228) > at > > org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45) > at > > org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67) > at > > org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:233) > at > > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:189) > at > > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:180) > at scala.Option.foreach(Option.scala:236) > at > > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:180) > at > org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180) > at > org.apache.spark.ContextCleaner.org > $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:173) > at > org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after > [120 seconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.resu
Re: mapwithstate Hangs with Error cleaning broadcast
I am using spark 1.6. I am not using any broadcast variable. This broadcast variable is probably used by the state management of mapwithState ...Manas On Tue, Mar 15, 2016 at 10:40 AM, Ted Yuwrote: > Which version of Spark are you using ? > > Can you show the code snippet w.r.t. broadcast variable ? > > Thanks > > On Tue, Mar 15, 2016 at 6:04 AM, manasdebashiskar > wrote: > >> Hi, >> I have a streaming application that takes data from a kafka topic and >> uses >> mapwithstate. >> After couple of hours of smooth running of the application I see a >> problem >> that seems to have stalled my application. >> The batch seems to have been stuck after the following error popped up. >> Has anyone seen this error or know what causes it? >> 14/03/2016 21:41:13,295 ERROR org.apache.spark.ContextCleaner: 95 - Error >> cleaning broadcast 7456 >> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 >> seconds]. This timeout is controlled by spark.rpc.askTimeout >> at >> org.apache.spark.rpc.RpcTimeout.org >> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) >> at >> >> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) >> at >> >> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) >> at >> >> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) >> at >> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) >> at >> >> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136) >> at >> >> org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228) >> at >> >> org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45) >> at >> >> org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67) >> at >> >> org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:233) >> at >> >> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:189) >> at >> >> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:180) >> at scala.Option.foreach(Option.scala:236) >> at >> >> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:180) >> at >> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180) >> at >> org.apache.spark.ContextCleaner.org >> $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:173) >> at >> org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68) >> Caused by: java.util.concurrent.TimeoutException: Futures timed out after >> [120 seconds] >> at >> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) >> at >> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) >> at >> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) >> at >> >> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) >> at scala.concurrent.Await$.result(package.scala:107) >> at >> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) >> ... 12 more >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/mapwithstate-Hangs-with-Error-cleaning-broadcast-tp26500.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
How to handle Option[Int] in dataframe
Hi, I have a case class with many columns that are Option[Int] or Option[Array[Byte]] and such. I would like to save it to parquet file and later read it back to my case class too. I found that Option[Int] when null returns 0 when the field is Null. My question: Is there a way to get Option[Int] from a row instead of Int from a dataframe? ...Manas Some more description /*My case class*/ case class Student(name: String, age: Option[Int]) val s = new Student("Manas",Some(35)) val s1 = new Student("Manas1",None) val student =sc.makeRDD(List(s,s1)).toDF /*Now writing the dataframe*/ student.write.parquet("/tmp/t1") /*Lets read it back*/ val st1 = sqlContext.read.parquet("/tmp/t1") st1.show +--++ | name| age| +--++ | Manas| 35| |Manas1|null| +--++ But now I want to cast my dataframe to the dataframe[Student]. What is the easiest way to do it? ..Manas
newAPIHadoopRDD file name
I would like to get the file name along with the associated objects so that I can do further mapping on it. My code below gives me AvroKey[myObject], NullWritable but I don't know how to get the file that gave those objects. sc.newAPIHadoopRDD(job.getConfiguration, classOf[AvroKeyInputFormat[myObject]], classOf[AvroKey[myObject]], classOf[NullWritable]) Basically I would like to end up having a tuple of (FileName, AvroKey[MyObject, NullWritable]) Any help is appreciated. .Manas
Re: Spark unit test fails
Trying to bump up the rank of the question. Any example on Github can someone point to? ..Manas On Fri, Apr 3, 2015 at 9:39 AM, manasdebashiskar manasdebashis...@gmail.com wrote: Hi experts, I am trying to write unit tests for my spark application which fails with javax.servlet.FilterRegistration error. I am using CDH5.3.2 Spark and below is my dependencies list. val spark = 1.2.0-cdh5.3.2 val esriGeometryAPI = 1.2 val csvWriter = 1.0.0 val hadoopClient= 2.3.0 val scalaTest = 2.2.1 val jodaTime= 1.6.0 val scalajHTTP = 1.0.1 val avro= 1.7.7 val scopt = 3.2.0 val config = 1.2.1 val jobserver = 0.4.1 val excludeJBossNetty = ExclusionRule(organization = org.jboss.netty) val excludeIONetty = ExclusionRule(organization = io.netty) val excludeEclipseJetty = ExclusionRule(organization = org.eclipse.jetty) val excludeMortbayJetty = ExclusionRule(organization = org.mortbay.jetty) val excludeAsm = ExclusionRule(organization = org.ow2.asm) val excludeOldAsm = ExclusionRule(organization = asm) val excludeCommonsLogging = ExclusionRule(organization = commons-logging) val excludeSLF4J = ExclusionRule(organization = org.slf4j) val excludeScalap = ExclusionRule(organization = org.scala-lang, artifact = scalap) val excludeHadoop = ExclusionRule(organization = org.apache.hadoop) val excludeCurator = ExclusionRule(organization = org.apache.curator) val excludePowermock = ExclusionRule(organization = org.powermock) val excludeFastutil = ExclusionRule(organization = it.unimi.dsi) val excludeJruby = ExclusionRule(organization = org.jruby) val excludeThrift = ExclusionRule(organization = org.apache.thrift) val excludeServletApi = ExclusionRule(organization = javax.servlet, artifact = servlet-api) val excludeJUnit = ExclusionRule(organization = junit) I found the link ( http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-SecurityException-when-running-tests-with-Spark-1-0-0-td6747.html#a6749 ) talking about the issue and the work around of the same. But that work around does not get rid of the problem for me. I am using an SBT build which can't be changed to maven. What am I missing? Stack trace - [info] FiltersRDDSpec: [info] - Spark Filter *** FAILED *** [info] java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package [info] at java.lang.ClassLoader.checkCerts(Unknown Source) [info] at java.lang.ClassLoader.preDefineClass(Unknown Source) [info] at java.lang.ClassLoader.defineClass(Unknown Source) [info] at java.security.SecureClassLoader.defineClass(Unknown Source) [info] at java.net.URLClassLoader.defineClass(Unknown Source) [info] at java.net.URLClassLoader.access$100(Unknown Source) [info] at java.net.URLClassLoader$1.run(Unknown Source) [info] at java.net.URLClassLoader$1.run(Unknown Source) [info] at java.security.AccessController.doPrivileged(Native Method) [info] at java.net.URLClassLoader.findClass(Unknown Source) Thanks Manas Manas Kar -- View this message in context: Spark unit test fails http://apache-spark-user-list.1001560.n3.nabble.com/Spark-unit-test-fails-tp22368.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Spark unit test fails
Hi experts, I am trying to write unit tests for my spark application which fails with javax.servlet.FilterRegistration error. I am using CDH5.3.2 Spark and below is my dependencies list. val spark = 1.2.0-cdh5.3.2 val esriGeometryAPI = 1.2 val csvWriter = 1.0.0 val hadoopClient= 2.3.0 val scalaTest = 2.2.1 val jodaTime= 1.6.0 val scalajHTTP = 1.0.1 val avro= 1.7.7 val scopt = 3.2.0 val config = 1.2.1 val jobserver = 0.4.1 val excludeJBossNetty = ExclusionRule(organization = org.jboss.netty) val excludeIONetty = ExclusionRule(organization = io.netty) val excludeEclipseJetty = ExclusionRule(organization = org.eclipse.jetty) val excludeMortbayJetty = ExclusionRule(organization = org.mortbay.jetty) val excludeAsm = ExclusionRule(organization = org.ow2.asm) val excludeOldAsm = ExclusionRule(organization = asm) val excludeCommonsLogging = ExclusionRule(organization = commons-logging) val excludeSLF4J = ExclusionRule(organization = org.slf4j) val excludeScalap = ExclusionRule(organization = org.scala-lang, artifact = scalap) val excludeHadoop = ExclusionRule(organization = org.apache.hadoop) val excludeCurator = ExclusionRule(organization = org.apache.curator) val excludePowermock = ExclusionRule(organization = org.powermock) val excludeFastutil = ExclusionRule(organization = it.unimi.dsi) val excludeJruby = ExclusionRule(organization = org.jruby) val excludeThrift = ExclusionRule(organization = org.apache.thrift) val excludeServletApi = ExclusionRule(organization = javax.servlet, artifact = servlet-api) val excludeJUnit = ExclusionRule(organization = junit) I found the link ( http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-SecurityException-when-running-tests-with-Spark-1-0-0-td6747.html#a6749 ) talking about the issue and the work around of the same. But that work around does not get rid of the problem for me. I am using an SBT build which can't be changed to maven. What am I missing? Stack trace - [info] FiltersRDDSpec: [info] - Spark Filter *** FAILED *** [info] java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package [info] at java.lang.ClassLoader.checkCerts(Unknown Source) [info] at java.lang.ClassLoader.preDefineClass(Unknown Source) [info] at java.lang.ClassLoader.defineClass(Unknown Source) [info] at java.security.SecureClassLoader.defineClass(Unknown Source) [info] at java.net.URLClassLoader.defineClass(Unknown Source) [info] at java.net.URLClassLoader.access$100(Unknown Source) [info] at java.net.URLClassLoader$1.run(Unknown Source) [info] at java.net.URLClassLoader$1.run(Unknown Source) [info] at java.security.AccessController.doPrivileged(Native Method) [info] at java.net.URLClassLoader.findClass(Unknown Source) Thanks Manas
Re: Cannot run spark-shell command not found.
If you are only interested in getting a hands on with Spark and not with building it with specific version of Hadoop use one of the bundle provider like cloudera. It will give you a very easy way to install and monitor your services.( I find installing via cloudera manager http://www.cloudera.com/content/cloudera/en/downloads.html) super easy. Currently there are on Spark 1.2. ..Manas On Mon, Mar 30, 2015 at 1:34 PM, vance46 wang2...@purdue.edu wrote: Hi all, I'm a newbee try to setup spark for my research project on a RedHat system. I've downloaded spark-1.3.0.tgz and untared it. and installed python, java and scala. I've set JAVA_HOME and SCALA_HOME and then try to use sudo sbt/sbt assembly according to https://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_CentOS6 . It pop-up with sbt command not found. I then try directly start spark-shell in ./bin using sudo ./bin/spark-shell and still command not found. I appreciate your help in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-run-spark-shell-command-not-found-tp22299.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: PairRDD serialization exception
Hi Sean, Below is the sbt dependencies that I am using. I gave another try by removing the provided keyword which failed with the same error. What confuses me is that the stack trace appears after few of the stages have already run completely. object V { val spark = 1.2.0-cdh5.3.0 val esriGeometryAPI = 1.2 val csvWriter = 1.0.0 val hadoopClient = 2.5.0 val scalaTest = 2.2.1 val jodaTime = 1.6.0 val scalajHTTP = 1.0.1 val avro = 1.7.7 val scopt = 3.2.0 val breeze = 0.8.1 val config = 1.2.1 } object Libraries { val EEAMessage = com.waterloopublic %% eeaformat % 1.0-SNAPSHOT val avro= org.apache.avro % avro-mapred % V.avro classifier hadoop2 val spark = org.apache.spark % spark-core_2.10 % V.spark % provided val hadoopClient= org.apache.hadoop % hadoop-client % V.hadoopClient % provided val esriGeometryAPI = com.esri.geometry % esri-geometry-api % V.esriGeometryAPI val scalaTest = org.scalatest %% scalatest % V.scalaTest % test val csvWriter = com.github.tototoshi %% scala-csv % V.csvWriter val jodaTime = com.github.nscala-time %% nscala-time % V.jodaTime % provided val scalajHTTP= org.scalaj %% scalaj-http % V.scalajHTTP val scopt= com.github.scopt %% scopt % V.scopt val breeze = org.scalanlp %% breeze % V.breeze val breezeNatives = org.scalanlp %% breeze-natives % V.breeze val config = com.typesafe % config % V.config } There are only few more things to try(like reverting back to Spark 1.1) before I run out of idea completely. Please share your insights. ..Manas On Wed, Mar 11, 2015 at 9:44 AM, Sean Owen so...@cloudera.com wrote: This usually means you are mixing different versions of code. Here it is complaining about a Spark class. Are you sure you built vs the exact same Spark binaries, and are not including them in your app? On Wed, Mar 11, 2015 at 1:40 PM, manasdebashiskar manasdebashis...@gmail.com wrote: (This is a repost. May be a simpler subject will fetch more attention among experts) Hi, I have a CDH5.3.2(Spark1.2) cluster. I am getting an local class incompatible exception for my spark application during an action. All my classes are case classes(To best of my knowledge) Appreciate any help. Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 346, datanode02): java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; local class incompatible:stream classdesc serialVersionUID = 8789839749593513237, local class serialVersionUID = -4145741279224749316 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Thanks Manas Manas Kar View this message in context: PairRDD serialization exception Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Joining data using Latitude, Longitude
There are few techniques currently available. Geomesa which uses GeoHash also can be proved useful.( https://github.com/locationtech/geomesa) Other potential candidate is https://github.com/Esri/gis-tools-for-hadoop especially https://github.com/Esri/geometry-api-java for inner customization. If you want to ask questions like nearby me then these are the basic steps. 1) Index your geometry data which uses R-Tree. 2) Write your joiner logic that takes advantage of the index tree to get you faster access. Thanks Manas On Wed, Mar 11, 2015 at 5:55 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Ted Dunning and Ellen Friedman's Time Series Databases has a section on this with some approaches to geo-encoding: https://www.mapr.com/time-series-databases-new-ways-store-and-access-data http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf On Tue, Mar 10, 2015 at 3:53 PM, John Meehan jnmee...@gmail.com wrote: There are some techniques you can use If you geohash http://en.wikipedia.org/wiki/Geohash the lat-lngs. They will naturally be sorted by proximity (with some edge cases so watch out). If you go the join route, either by trimming the lat-lngs or geohashing them, you’re essentially grouping nearby locations into buckets — but you have to consider the borders of the buckets since the nearest location may actually be in an adjacent bucket. Here’s a paper that discusses an implementation: http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf On Mar 9, 2015, at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you using SparkSQL for the join? In that case I'm not quiet sure you have a lot of options to join on the nearest co-ordinate. If you are using the normal Spark code (by creating key-pair on lat,lon) you can apply certain logic like trimming the lat,lon etc. If you want more specific computing then you are better off using haversine formula. http://www.movable-type.co.uk/scripts/latlong.html
java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; local class incompatible: stream classdesc
Hi, I have a CDH5.3.2(Spark1.2) cluster. I am getting an local class incompatible exception for my spark application during an action. All my classes are case classes(To best of my knowledge) Appreciate any help. Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 346, datanode02): java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; local class incompatible: stream classdesc serialVersionUID = 8789839749593513237, local class serialVersionUID = -4145741279224749316 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Thanks Manas
Re: RDDs
The above is a great example using thread. Does any one have an example using scala/Akka Future to do the same. I am looking for an example like that which uses a Akka Future and does something if the Future Timesout On Tue, Mar 3, 2015 at 7:00 AM, Kartheek.R kartheek.m...@gmail.com wrote: Hi TD, You can always run two jobs on the same cached RDD, and they can run in parallel (assuming you launch the 2 jobs from two different threads) Is this a correct way to launch jobs from two different threads? val threadA = new Thread(new Runnable { def run() { for(i- 0 until end) { val numAs = logData.filter(line = line.contains(a)) println(Lines with a: %s.format(numAs.count)) } } }) val threadB = new Thread(new Runnable { def run() { for(i- 0 until end) { val numBs = logData.filter(line = line.contains(b)) println(Lines with b: %s.format(numBs.count)) } } }) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p21892.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: RDDs
The above is a great example using thread. Does any one have an example using scala/Akka Future to do the same. I am looking for an example like that which uses a Akka Future and does something if the Future Timesout On Tue, Mar 3, 2015 at 9:16 AM, Manas Kar manasdebashis...@gmail.com wrote: The above is a great example using thread. Does any one have an example using scala/Akka Future to do the same. I am looking for an example like that which uses a Akka Future and does something if the Future Timesout On Tue, Mar 3, 2015 at 7:00 AM, Kartheek.R kartheek.m...@gmail.com wrote: Hi TD, You can always run two jobs on the same cached RDD, and they can run in parallel (assuming you launch the 2 jobs from two different threads) Is this a correct way to launch jobs from two different threads? val threadA = new Thread(new Runnable { def run() { for(i- 0 until end) { val numAs = logData.filter(line = line.contains(a)) println(Lines with a: %s.format(numAs.count)) } } }) val threadB = new Thread(new Runnable { def run() { for(i- 0 until end) { val numBs = logData.filter(line = line.contains(b)) println(Lines with b: %s.format(numBs.count)) } } }) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p21892.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to debug a Hung task
Hi, I have a spark application that hangs on doing just one task (Rest 200-300 task gets completed in reasonable time) I can see in the Thread dump which function gets stuck how ever I don't have a clue as to what value is causing that behaviour. Also, logging the inputs before the function is executed does not help as the actual message gets buried in logs. How do one go about debugging such case? Also, is there a way I can wrap my function inside some sort of timer based environment and if it took too long I would throw a stack trace or some sort. Thanks Manas
How to print more lines in spark-shell
Hi experts, I am using Spark 1.2 from CDH5.3. When I issue commands like myRDD.take(10) the result gets truncated after 4-5 records. Is there a way to configure the same to show more items? ..Manas
Master dies after program finishes normally
Hi, I have a Hidden Markov Model running with 200MB data. Once the program finishes (i.e. all stages/jobs are done) the program hangs for 20 minutes or so before killing master. In the spark master the following log appears. 2015-02-12 13:00:05,035 ERROR akka.actor.ActorSystemImpl: Uncaught fatal error from thread [sparkMaster-akka.actor.default-dispatcher-31] shutting down ActorSystem [sparkMaster] java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.immutable.List$.newBuilder(List.scala:396) at scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:69) at scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:105) at scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:58) at scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:53) at scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239) at scala.collection.TraversableLike$class.map(TraversableLike.scala:243) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:26) at org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:22) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.json4s.MonadicJValue.org $json4s$MonadicJValue$$findDirectByName(MonadicJValue.scala:22) at org.json4s.MonadicJValue.$bslash(MonadicJValue.scala:16) at org.apache.spark.util.JsonProtocol$.taskStartFromJson(JsonProtocol.scala:450) at org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:423) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:71) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:69) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55) at org.apache.spark.deploy.master.Master.rebuildSparkUI(Master.scala:726) at org.apache.spark.deploy.master.Master.removeApplication(Master.scala:675) at org.apache.spark.deploy.master.Master.finishApplication(Master.scala:653) at org.apache.spark.deploy.master.Master$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$29.apply(Master.scala:399) Can anyone help? ..Manas
Re: Master dies after program finishes normally
I have 5 workers each executor-memory 8GB of memory. My driver memory is 8 GB as well. They are all 8 core machines. To answer Imran's question my configurations are thus. executor_total_max_heapsize = 18GB This problem happens at the end of my program. I don't have to run a lot of jobs to see this behaviour. I can see my output correctly in HDFS and all. I will give it one more try after increasing master's memory(which is default 296MB to 512 MB) ..manas On Thu, Feb 12, 2015 at 2:14 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: How many nodes do you have in your cluster, how many cores, what is the size of the memory? On Fri, Feb 13, 2015 at 12:42 AM, Manas Kar manasdebashis...@gmail.com wrote: Hi Arush, Mine is a CDH5.3 with Spark 1.2. The only change to my spark programs are -Dspark.driver.maxResultSize=3g -Dspark.akka.frameSize=1000. ..Manas On Thu, Feb 12, 2015 at 2:05 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: What is your cluster configuration? Did you try looking at the Web UI? There are many tips here http://spark.apache.org/docs/1.2.0/tuning.html Did you try these? On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar manasdebashis...@gmail.com wrote: Hi, I have a Hidden Markov Model running with 200MB data. Once the program finishes (i.e. all stages/jobs are done) the program hangs for 20 minutes or so before killing master. In the spark master the following log appears. 2015-02-12 13:00:05,035 ERROR akka.actor.ActorSystemImpl: Uncaught fatal error from thread [sparkMaster-akka.actor.default-dispatcher-31] shutting down ActorSystem [sparkMaster] java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.immutable.List$.newBuilder(List.scala:396) at scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:69) at scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:105) at scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:58) at scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:53) at scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239) at scala.collection.TraversableLike$class.map(TraversableLike.scala:243) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:26) at org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:22) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.json4s.MonadicJValue.org $json4s$MonadicJValue$$findDirectByName(MonadicJValue.scala:22) at org.json4s.MonadicJValue.$bslash(MonadicJValue.scala:16) at org.apache.spark.util.JsonProtocol$.taskStartFromJson(JsonProtocol.scala:450) at org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:423) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:71) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:69) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55) at org.apache.spark.deploy.master.Master.rebuildSparkUI(Master.scala:726) at org.apache.spark.deploy.master.Master.removeApplication(Master.scala:675) at org.apache.spark.deploy.master.Master.finishApplication(Master.scala:653) at org.apache.spark.deploy.master.Master$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$29.apply(Master.scala:399) Can anyone help? ..Manas -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Master dies after program finishes normally
I have 5 workers each executor-memory 8GB of memory. My driver memory is 8 GB as well. They are all 8 core machines. To answer Imran's question my configurations are thus. executor_total_max_heapsize = 18GB This problem happens at the end of my program. I don't have to run a lot of jobs to see this behaviour. I can see my output correctly in HDFS and all. I will give it one more try after increasing master's memory(which is default 296MB to 512 MB)
Spark 1.2 + Avro file does not work in HDP2.2
Hi Experts, I have recently installed HDP2.2(Depends on hadoop 2.6). My spark 1.2 is built with hadoop 2.4 profile. My program has following dependencies val avro= org.apache.avro % avro-mapred %1.7.7 val spark = org.apache.spark % spark-core_2.10 % 1.2.0 % provided My program to read avro files fails with the following error. What am I doing wrong? java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:133) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
Asymmetric spark cluster memory utilization
Hi, I have a spark cluster that has 5 machines with 32 GB memory each and 2 machines with 24 GB each. I believe the spark.executor.memory will assign the executor memory for all executors. How can I use 32 GB memory from the first 5 machines and 24 GB from the next 2 machines. Thanks ..Manas
How to create Track per vehicle using spark RDD
Hi, I have an RDD containing Vehicle Number , timestamp, Position. I want to get the lag function equivalent to my RDD to be able to create track segment of each Vehicle. Any help? PS: I have tried reduceByKey and then splitting the List of position in tuples. For me it runs out of memory every time because of the volume of data. ...Manas *For some reason I have never got any reply to my emails to the user group. I am hoping to break that trend this time. :)*
Null values in Date field only when RDD is saved as File.
Hi, I am using a library that parses Ais Messages. My code which follows the simple steps gives me null values in Date field. 1) Get the message from file. 2) parse the message. 3) map the message RDD to only keep the (Date, SomeInfo) 4) take top 100 element. Result = the Date field appears fine on the screen 5) save the tuple RDD(created at step 4) to HDFS using SaveAsTextFile. Result = When I check the saved File all my Date fields are Null. Please guide me in the right direction.
Spark Streaming Example with CDH5
Hi Spark Gurus, I am trying to compile a spark streaming example with CDH5 and having problem compiling it. Has anyone created an example spark streaming using CDH5(preferably Spark 0.9.1) and would be kind enough to share the build.sbt(.scala) file?(or point to their example on github). I know there is a streaming example here https://github.com/apache/spark/tree/master/examples but I am looking for something that runs with CDH5. My build.scala files looks like given below. object Dependency { // Versions object V { val Akka = 2.3.0 val scala = 2.10.4 val cloudera = 0.9.0-cdh5.0.0 } val sparkCore = org.apache.spark %% spark-core% V.cloudera val sparkStreaming = org.apache.spark %% spark-streaming % V.cloudera resolvers ++= Seq( cloudera repo at https://repository.cloudera.com/artifactory/cloudera-repos/;, haddop repo at https://repository.cloudera.com/content/repositories/releases/;) I have also attached the complete build.scala file for sake of completeness. sbt dist gives the following error: object SecurityManager is not a member of package org.apache.spark [error] import org.apache.spark.{SparkConf, SecurityManager} build.scala http://apache-spark-user-list.1001560.n3.nabble.com/file/n7796/build.scala Appreciate the great work the spark community is doing. It is by far the best thing I have worked on. ..Manas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Example-with-CDH5-tp7796.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Shark on cloudera CDH5 error
No replies yet. Guess everyone who had this problem knew the obvious reason why the error occurred. It took me some time to figure out the work around though. It seems shark depends on /var/lib/spark/shark-0.9.1/lib_managed/jars/org.apache.hadoop/hadoop-core/hadoop-core.jar for client server communication. CDH5 should rely on hadoop-core-2.3.0-mr1-cdh5.0.0.jar. 1) Grab it from other CDH modules(I chose hadoop) and get this jar from it's library. 2) Remove the jar in /var/lib/spark/shark-0.9.1/lib_managed/jars/org.apache.hadoop/hadoop-core 3) place the jar from(step1) in hadoop-core folder of step2. Hope this saves some time for some one who has the similar problem. ..Manas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Shark-on-cloudera-CDH5-error-tp5226p5374.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Can I share RDD between a pyspark and spark API
Hi experts. I have some pre-built python parsers that I am planning to use, just because I don't want to write them again in scala. However after the data is parsed I would like to take the RDD and use it in a scala program.(Yes, I like scala more than python and more comfortable in scala :) In doing so I don't want to push the parsed data to disk and then re-obtain it via the scala class. Is there a way I can achieve what I want in an efficient way? ..Manas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-I-share-RDD-between-a-pyspark-and-spark-API-tp5415.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
ETL for postgres to hadoop
Hi All, I have some spatial data in postgres machine. I want to be able to move that data to Hadoop and do some geo-processing. I tried using sqoop to move the data to Hadoop but it complained about the position data(which it says can't recognize) Does anyone have any idea as to how to do it easily? Thanks Manas www.exactearth.com[cid:ee_gradient_tm_150wide.png@f20f7501e5a14d6f85ec33629f725228]www.exactearth.com Manas Kar Intermediate Software Developer, Product Development | exactEarth Ltd. 60 Struck Ct. Cambridge, Ontario N1R 8L2 office. +1.519.622.4445 ext. 5869 | direct: +1.519.620.5869 email. manas@exactearth.com web. www.exactearth.com This e-mail and any attachment is for authorized use by the intended recipient(s) only. It contains proprietary or confidential information and is not to be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail, any attachment and all copies and inform the sender. Thank you. inline: ee_gradient_tm_150wide.png