seriazable error in apache spark job
I am encountering below error. Can somebody guide ? Something similar is one this link https://github.com/elastic/elasticsearch-hadoop/issues/298 actor.MentionCrawlActor java.io.NotSerializableException: actor.MentionCrawlActor at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) ~[na:1.7.0_79] at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) ~[na:1.7.0_79] at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) ~[na:1.7.0_79] at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) ~[na:1.7.0_79] at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) ~[na:1.7.0_79] at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) ~[na:1.7.0_79] at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) ~[na:1.7.0_79] at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) ~[na:1.7.0_79] at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) ~[na:1.7.0_79] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/seriazable-error-in-apache-spark-job-tp25732.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Streaming Restart at scheduled intervals
Hi All, I am creating spark twitter streaming connection in my app over long period of time. When I have some new keywords I need to add them to the spark streaming connection. I need to stop and start the current twitter streaming connection in this case. I have tried akka actor scheduling but could not achieve the same. Have anybody have idea how to do that ? Regards Pankaj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Restart-at-scheduled-intervals-tp24192.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Out of memory with twitter spark streaming
Hi I am running one application using activator where I am retrieving tweets and storing them to mysql database using below code. I get OOM error after 5-6 hour with xmx1048M. If I increase the memory the OOM get delayed only. Can anybody give me clue. Here is the code var tweetStream = TwitterUtils.createStream(ssc, None,keywords) var tweets = tweetStream.map(tweet = { var user = tweet.getUser var replyStatusId = tweet.getInReplyToStatusId var reTweetStatus = tweet.getRetweetedStatus var pTweetId = -1L var pcreatedAt = 0L if(reTweetStatus != null){ pTweetId = reTweetStatus.getId pcreatedAt = reTweetStatus.getCreatedAt.getTime } tweet.getCreatedAt.getTime + |$ + tweet.getId + |$+user.getId + |$ + user.getName+ |$ + user.getScreenName + |$ + user.getDescription + |$ + tweet.getText.trim + |$ + user.getFollowersCount + |$ + user.getFriendsCount + |$ + tweet.getGeoLocation + |$ + user.getLocation + |$ + user.getBiggerProfileImageURL + |$ + replyStatusId + |$ + pTweetId + |$ + pcreatedAt } ) tweets.foreachRDD(tweetsRDD = {tweetsRDD.distinct() val count = tweetsRDD.count println(* +%s tweets found on this RDD.format(count)) if (count 0){ var timeMs = System.currentTimeMillis var counter = DBQuery.getProcessedCount() var location=tweets/+ counter +/ tweetsRDD.collect().map(tweet= DBQuery.saveTweets(tweet)) //tweetsRDD.saveAsTextFile(location+ timeMs)+ .txt DBQuery.addTweetRDD(counter) } }) // Checkpoint directory to recover from failures println(tweets for the last stream are saved which can be processed later) val= f:/svn1/checkpoint/ ssc.checkpoint(checkpointDir) ssc.start() ssc.awaitTermination() regards Pankaj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-with-twitter-spark-streaming-tp24162.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: reading a csv dynamically
Yes I think you need to create one map first which will keep the number of values in every line. Now you can group all the records with same number of values. Now you know how many types of arrays you will have. val dataRDD = sc.textFile(file.csv) val dataLengthRDD = dataRDD .map(line=(_.split(,).length,line)) val groupedData = dataLengthRDD.groupByKey() now you can process the groupedData as it will have arrays of length x in one RDD. groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, IterableV) pairs. I hope this helps Regards Pankaj Infoshore Software India -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/reading-a-csv-dynamically-tp21304p21307.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Finding most occurrences in a JSON Nested Array
send me the current code here. I will fix and send back to you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p21295.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Finding most occurrences in a JSON Nested Array
I just checked the post. do you need help still ? I think getAs(Seq[String]) should help. If you are still stuck let me know. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p21252.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to compute RDD[(String, Set[String])] that include large Set
Instead of counted.saveAsText(“/path/to/save/dir) if you call counted.collect what happens ? If you still face the same issue please paste the stacktrace here. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-large-Set-tp21248p21250.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream
Good luck. Let me know If I can assist you further Regards -Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646 Skype pankaj.narang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration-with-akka-http-akka-stream-tp20926p20991.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Set EXTRA_JAR environment variable for spark-jobserver
I suggest to create uber jar instead. check my thread for the same http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration-with-akka-http-akka-stream-td20926.html Regards -Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646 Skype pankaj.narang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Set-EXTRA-JAR-environment-variable-for-spark-jobserver-tp20989p20992.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Finding most occurrences in a JSON Nested Array
Thats great. I was not having access on the developer machine so sent you the psuedo code only. Happy to see its working. If you need any more help related to spark let me know anytime. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p20997.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL implementation error
As per telephonic call see how we can fetch the count val tweetsCount = sql(SELECT COUNT(*) FROM tweets) println(f\n\n\nThere are ${tweetsCount.collect.head.getLong(0)} Tweets on this Dataset\n\n) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-implementation-error-tp20901p21008.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Finding most occurrences in a JSON Nested Array
If you need more help let me know -Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646 Skype pankaj.narang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p20976.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Finding most occurrences in a JSON Nested Array
try as below results.map(row = row(1)).collect try var hobbies = results.flatMap(row = row(1)) It will create all the hobbies in a simpe array nowob hbmap =hobbies.map(hobby =(hobby,1)).reduceByKey((hobcnt1,hobcnt2) =hobcnt1+hobcnt2) It will aggregate hobbies as below {swimming,2}, {hiking,1} Now hbmap .map{case(hobby,count)=(count,hobby)}.sortByKey(ascending =false).collect will give you hobbies sorted in descending by their count This is pseudo code and must help you Regards Pankaj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p20975.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Finding most occurrences in a JSON Nested Array
yes row(1).collect would be wrong as it is not tranformation on RDD try getString(1) to fetch the RDD I already said this is the psuedo code. If it does not help let me know I will run the code and send you get/getAs should work for you for example var hashTagsList = popularHashTags.flatMap ( x = x.getAs[Seq[String]](0)) Even if you want I will take the remote of your machine to fix that Regards Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646 Skype pankaj.narang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p20985.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: saveAsTextFile
If you can paste the code here I can certainly help. Also confirm the version of spark you are using Regards Pankaj Infoshore Software India -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: NoClassDefFoundError when trying to run spark application
do you assemble the uber jar ? you can use sbt assembly to build the jar and then run. It should fix the issue -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoClassDefFoundError-when-trying-to-run-spark-application-tp20707p20944.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream
Like before I get a java.lang.NoClassDefFoundError: akka/stream/FlowMaterializer$ This can be solved using assembly plugin. you need to enable assembly plugin in global plugins C:\Users\infoshore\.sbt\0.13\plugins add a line in plugins.sbt addSbtPlugin(com.eed3si9n % sbt-assembly % 0.11.0) and then add the following lines in build.sbt import AssemblyKeys._ // put this at the top of the file seq(assemblySettings: _*) Also in the bottom dont forget to add assemblySettings mergeStrategy in assembly := { case m if m.toLowerCase.endsWith(manifest.mf) = MergeStrategy.discard case m if m.toLowerCase.matches(meta-inf.*\\.sf$) = MergeStrategy.discard case log4j.properties = MergeStrategy.discard case m if m.toLowerCase.startsWith(meta-inf/services/) = MergeStrategy.filterDistinctLines case reference.conf= MergeStrategy.concat case _ = MergeStrategy.first } Now in your sbt run sbt assembly that will create the jar which can be run without --jars options as this will be a uber jar containing all jars Also nosuchmethod exception is thrown when there is difference in versions of complied and runtime versions. What is the version of spark you are using ? You need to use same version in build.sbt Here is your build.sbt libraryDependencies += org.apache.spark %% spark-core % 1.1.1 //exclude(com.typesafe, config) libraryDependencies += org.apache.spark %% spark-sql % 1.1.1 libraryDependencies += com.datastax.cassandra % cassandra-driver-core % 2.1.3 libraryDependencies += com.datastax.spark %% spark-cassandra-connector % 1.1.0 withSources() withJavadoc() libraryDependencies += org.apache.cassandra % cassandra-thrift % 2.0.5 libraryDependencies += joda-time % joda-time % 2.6 and your error is Exception in thread main java.lang.NoSuchMethodError: com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J at akka.stream.StreamSubscriptionTimeoutSettings$.apply(FlowMaterializer.scala:256) I think there is version mismatch on the jars you use at runtime If you need more help add me on skype pankaj.narang ---Pankaj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration-with-akka-http-akka-stream-tp20926p20950.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Publishing streaming results to web interface
Thomus, Spark does not provide any web interface directly. There might be third party apps providing dashboards but I am not aware of any for the same purpose. *You can use some methods so that this data is saved on file system instead of being printed on screen Some of the methods you can use ON RDD are saveAsObjectFile, saveAsFile * Now you can read these files to show them on web interface in any language of your choice Regards Pankaj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Publishing-streaming-results-to-web-interface-tp20948p20949.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Reading nested JSON data with Spark SQL
Hih I am having simiiar problem and tries your solution with spark 1.2 build withing hadoop I am saving object to parquet files where some fields are of type Array. When I fetch them as below I get java.lang.ClassCastException: [B cannot be cast to java.lang.CharSequence def fetchTags(rows: SchemaRDD) = { rows.flatMap ( x = ((x.getAs[Buffer[CharSequence]](0)).map(_.toString())) ) } The value I am fetching have been stored as Array of Strings. I have tried replacing Buffer[CharSequence] with Array[String] Seq[String] Seq[Seq[char]] but still got errors Can you provide clue. Pankaj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-nested-JSON-data-with-Spark-SQL-tp19310p20933.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Reading nested JSON data with Spark SQL
Also it looks like that when I store the String in parquet and try to fetch them using spark code I got classcast exception below how my array of strings are saved. each character ascii value is present in array of ints res25: Array[Seq[String]] r= Array(ArrayBuffer(Array(104, 116, 116, 112, 58, 47, 47, 102, 98, 46, 109, 101, 47, 51, 67, 111, 72, 108, 99, 101, 77, 103)), ArrayBuffer(), ArrayBuffer(), ArrayBuffer(), ArrayBuffer(Array(104, 116, 116, 112, 58, 47, 47, 105, 110, 115, 116, 97, 103, 114, 97, 109, 46, 99, 111, 109, 47, 112, 47, 120, 84, 50, 51, 78, 76, 105, 85, 55, 102, 47)), ArrayBuffer(), ArrayBuffer(Array(104, 116, 116, 112, 58, 47, 47, 105, 110, 115, 116, 97, 103, 114, 97, 109, 46, 99, 111, 109, 47, 112, 47, 120, 84, 50, 53, 72, 52, 111, 90, 95, 114, 47)), ArrayBuffer(Array(104, 116, 116, 112, 58, 47, 47, 101, 122, 101, 101, 99, 108, 97, 115, 115, 105, 102, 105, 101, 100, 97, 100, 115, 46, 99, 111, 109, 47, 47, 100, 101, 115, 99, 47, 106, 97, 105, 112, 117, 114, 47, 49, 48, 51, 54, 50, 50, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-nested-JSON-data-with-Spark-SQL-tp19310p20935.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Reading nested JSON data with Spark SQL
oops sqlContext.setConf(spark.sql.parquet.binaryAsString, true) thois solved the issue important for everyone -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-nested-JSON-data-with-Spark-SQL-tp19310p20936.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org