seriazable error in apache spark job

2015-12-17 Thread Pankaj Narang
I am encountering below error. Can somebody guide ?

Something similar is one this link
https://github.com/elastic/elasticsearch-hadoop/issues/298


actor.MentionCrawlActor
java.io.NotSerializableException: actor.MentionCrawlActor
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
~[na:1.7.0_79]
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
~[na:1.7.0_79]
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
~[na:1.7.0_79]
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
~[na:1.7.0_79]
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
~[na:1.7.0_79]
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
~[na:1.7.0_79]
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
~[na:1.7.0_79]
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
~[na:1.7.0_79]
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
~[na:1.7.0_79]



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/seriazable-error-in-apache-spark-job-tp25732.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming Restart at scheduled intervals

2015-08-10 Thread Pankaj Narang
Hi All,

I am creating spark twitter streaming connection in my app over long period
of time. When I have some new keywords I need to add them to the spark
streaming connection. I need to stop and start the current twitter streaming
connection in this case.

I have tried akka actor scheduling but could not achieve the same.

Have anybody have idea how to do that ?

Regards
Pankaj



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Restart-at-scheduled-intervals-tp24192.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Out of memory with twitter spark streaming

2015-08-06 Thread Pankaj Narang
Hi 

I am running one application using activator where I am retrieving tweets
and storing them to mysql database using below code. 

I get OOM error after 5-6 hour with xmx1048M. If I increase the memory the
OOM get delayed only.

Can anybody give me clue. Here is the code

 var tweetStream  = TwitterUtils.createStream(ssc, None,keywords)
var tweets = tweetStream.map(tweet = { 
  var user = tweet.getUser
  var replyStatusId = tweet.getInReplyToStatusId
  var reTweetStatus = tweet.getRetweetedStatus
  var pTweetId = -1L
  var pcreatedAt = 0L
  if(reTweetStatus != null){
pTweetId = reTweetStatus.getId
pcreatedAt = reTweetStatus.getCreatedAt.getTime
  }  
  tweet.getCreatedAt.getTime + |$ + tweet.getId +
|$+user.getId + |$ + user.getName+ |$ + user.getScreenName + |$ +
user.getDescription +
  |$ + tweet.getText.trim + |$ + user.getFollowersCount +
|$ + user.getFriendsCount + |$ + tweet.getGeoLocation + |$ +
  user.getLocation + |$ + user.getBiggerProfileImageURL + |$
+ replyStatusId + |$ + pTweetId + |$ + pcreatedAt
} )
  tweets.foreachRDD(tweetsRDD = {tweetsRDD.distinct()
 val count = tweetsRDD.count
 println(* +%s tweets found on
this RDD.format(count))
 if (count   0){
var timeMs = System.currentTimeMillis
var counter =
DBQuery.getProcessedCount()
   var location=tweets/+ counter +/ 
tweetsRDD.collect().map(tweet= 
DBQuery.saveTweets(tweet)) 
//tweetsRDD.saveAsTextFile(location+
timeMs)+ .txt
DBQuery.addTweetRDD(counter) 
}
})
  
   // Checkpoint directory to recover from failures
   println(tweets for the last stream are saved which can be processed
later)
   val= f:/svn1/checkpoint/
ssc.checkpoint(checkpointDir)
ssc.start()
ssc.awaitTermination()


regards
Pankaj



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-with-twitter-spark-streaming-tp24162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: reading a csv dynamically

2015-01-21 Thread Pankaj Narang
Yes I think you need to create one map first which will keep the number of
values in every line. Now you can group all the records with same number of
values. Now you know how many types of arrays you will have.


val dataRDD = sc.textFile(file.csv) 
val dataLengthRDD =   dataRDD .map(line=(_.split(,).length,line))
val groupedData = dataLengthRDD.groupByKey()

now you can process the groupedData as it will have arrays of length x in
one RDD.

groupByKey([numTasks])  When called on a dataset of (K, V) pairs, returns a
dataset of (K, IterableV) pairs. 


I hope this helps

Regards
Pankaj 
Infoshore Software
India




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/reading-a-csv-dynamically-tp21304p21307.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Finding most occurrences in a JSON Nested Array

2015-01-21 Thread Pankaj Narang
send me the current code here. I will fix and send back to you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p21295.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Finding most occurrences in a JSON Nested Array

2015-01-19 Thread Pankaj Narang
I  just checked the post. do you need help still ?

I think getAs(Seq[String]) should help.

If you are still stuck let me know. 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p21252.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Pankaj Narang
Instead of counted.saveAsText(“/path/to/save/dir) if you call
counted.collect what happens ?


If you still face the same issue please paste the stacktrace here.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-large-Set-tp21248p21250.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream

2015-01-06 Thread Pankaj Narang
Good luck. Let me know If I can assist you further

Regards
-Pankaj 
Linkedin 
https://www.linkedin.com/profile/view?id=171566646
Skype 
pankaj.narang



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration-with-akka-http-akka-stream-tp20926p20991.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-06 Thread Pankaj Narang
I suggest to create uber jar instead.

check my thread for the same

http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration-with-akka-http-akka-stream-td20926.html


Regards
-Pankaj 
Linkedin 
https://www.linkedin.com/profile/view?id=171566646
Skype 
pankaj.narang



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Set-EXTRA-JAR-environment-variable-for-spark-jobserver-tp20989p20992.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Finding most occurrences in a JSON Nested Array

2015-01-06 Thread Pankaj Narang
Thats great. I was not having access on the developer machine so sent you the
psuedo code only.

Happy to see its working. If you need any more help related to spark let me
know anytime.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p20997.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL implementation error

2015-01-06 Thread Pankaj Narang
As per telephonic call see how we can fetch the count

 val tweetsCount = sql(SELECT COUNT(*) FROM tweets)
  println(f\n\n\nThere are ${tweetsCount.collect.head.getLong(0)} Tweets on
this Dataset\n\n)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-implementation-error-tp20901p21008.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Finding most occurrences in a JSON Nested Array

2015-01-05 Thread Pankaj Narang
If you need more help let me know

-Pankaj
Linkedin 
https://www.linkedin.com/profile/view?id=171566646
Skype
pankaj.narang



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p20976.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Finding most occurrences in a JSON Nested Array

2015-01-05 Thread Pankaj Narang
try as below

results.map(row = row(1)).collect

try 

var hobbies = results.flatMap(row = row(1))

It will create all the hobbies in a simpe array nowob

hbmap =hobbies.map(hobby =(hobby,1)).reduceByKey((hobcnt1,hobcnt2)
=hobcnt1+hobcnt2)

It will aggregate  hobbies as below

{swimming,2}, {hiking,1}


Now hbmap .map{case(hobby,count)=(count,hobby)}.sortByKey(ascending
=false).collect 

will give you hobbies sorted in descending by their count
 
This is pseudo code and must help you

Regards
Pankaj






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p20975.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Finding most occurrences in a JSON Nested Array

2015-01-05 Thread Pankaj Narang
yes row(1).collect would be wrong as it is not tranformation on RDD try
getString(1) to fetch the RDD

I already said this is the psuedo code. If it does not help let me know I
will run the code and send you


get/getAs should  work for you for example 

   var hashTagsList =  popularHashTags.flatMap ( x =
x.getAs[Seq[String]](0))  


Even if  you want I will take the remote of your machine to fix that
Regards
Pankaj
Linkedin 
https://www.linkedin.com/profile/view?id=171566646
Skype 
pankaj.narang






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p20985.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTextFile

2015-01-03 Thread Pankaj Narang
If you can paste the code here I can certainly help.

Also confirm the version of spark you are using

Regards
Pankaj 
Infoshore Software 
India



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NoClassDefFoundError when trying to run spark application

2015-01-02 Thread Pankaj Narang
do you assemble the uber jar ?

you can use sbt assembly to build the jar and then run. It should fix the
issue



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NoClassDefFoundError-when-trying-to-run-spark-application-tp20707p20944.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream

2015-01-02 Thread Pankaj Narang
Like before I get a java.lang.NoClassDefFoundError:
akka/stream/FlowMaterializer$

This can be solved using assembly plugin. you need to enable assembly plugin
in global plugins

C:\Users\infoshore\.sbt\0.13\plugins
 add a line in plugins.sbt  addSbtPlugin(com.eed3si9n % sbt-assembly %
0.11.0)



 and then add the following lines in build.sbt 

import AssemblyKeys._ // put this at the top of the file

seq(assemblySettings: _*)

Also in the bottom dont forget to add

assemblySettings

mergeStrategy in assembly := {
  case m if m.toLowerCase.endsWith(manifest.mf)  =
MergeStrategy.discard
  case m if m.toLowerCase.matches(meta-inf.*\\.sf$)  =
MergeStrategy.discard
  case log4j.properties  =
MergeStrategy.discard
  case m if m.toLowerCase.startsWith(meta-inf/services/) =
MergeStrategy.filterDistinctLines
  case reference.conf=
MergeStrategy.concat
  case _   =
MergeStrategy.first
}


Now in your sbt run sbt assembly that will create the jar which can be run
without --jars options
as this will be a uber jar containing all jars



Also nosuchmethod exception is thrown when there is difference in versions
of complied and runtime versions.

What is the version of spark you are using ? You need to use same version in
build.sbt


Here is your build.sbt


libraryDependencies += org.apache.spark %% spark-core % 1.1.1
//exclude(com.typesafe, config) 

libraryDependencies += org.apache.spark %% spark-sql % 1.1.1 

libraryDependencies += com.datastax.cassandra % cassandra-driver-core %
2.1.3 

libraryDependencies += com.datastax.spark %% spark-cassandra-connector %
1.1.0 withSources() withJavadoc() 

libraryDependencies += org.apache.cassandra % cassandra-thrift % 2.0.5 

libraryDependencies += joda-time % joda-time % 2.6 


and your error is Exception in thread main java.lang.NoSuchMethodError:
com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J
 
at
akka.stream.StreamSubscriptionTimeoutSettings$.apply(FlowMaterializer.scala:256)
 
 I think there is version mismatch on the jars you use at runtime


 If you need more help add me on skype pankaj.narang


---Pankaj





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration-with-akka-http-akka-stream-tp20926p20950.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Publishing streaming results to web interface

2015-01-02 Thread Pankaj Narang
Thomus,

Spark does not provide any web interface directly. There might be third
party apps providing dashboards
but I am not aware of any for the same purpose.

*You can use some methods so that this data is saved on file system instead
of being printed on screen

Some of the methods you can use ON RDD are saveAsObjectFile, saveAsFile
*


Now you can read these files to show them on web interface in  any language
of your choice

Regards
Pankaj






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Publishing-streaming-results-to-web-interface-tp20948p20949.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Reading nested JSON data with Spark SQL

2015-01-01 Thread Pankaj Narang
Hih

I am having simiiar problem and tries your solution with spark 1.2 build
withing hadoop

I am saving object to parquet files where some fields are of type Array.

When I fetch them as below I get 

 java.lang.ClassCastException: [B cannot be cast to java.lang.CharSequence



def fetchTags(rows: SchemaRDD) = {
   rows.flatMap ( x =
((x.getAs[Buffer[CharSequence]](0)).map(_.toString())) )
  }



The value I am fetching have been stored as Array of Strings. I have tried
replacing Buffer[CharSequence] with Array[String] Seq[String] Seq[Seq[char]]
but still got errors

Can you provide clue. 

Pankaj



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-nested-JSON-data-with-Spark-SQL-tp19310p20933.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Reading nested JSON data with Spark SQL

2015-01-01 Thread Pankaj Narang
Also it looks like that when  I store the String in parquet and try to fetch
them using spark code I got classcast exception


below how my array of strings are saved. each character ascii value is
 present in array of ints
res25: Array[Seq[String]] r= Array(ArrayBuffer(Array(104, 116, 116, 112, 58,
47, 47, 102, 98, 46, 109, 101, 47, 51, 67, 111, 72, 108, 99, 101, 77, 103)),
ArrayBuffer(), ArrayBuffer(), ArrayBuffer(), ArrayBuffer(Array(104, 116,
116, 112, 58, 47, 47, 105, 110, 115, 116, 97, 103, 114, 97, 109, 46, 99,
111, 109, 47, 112, 47, 120, 84, 50, 51, 78, 76, 105, 85, 55, 102, 47)),
ArrayBuffer(), ArrayBuffer(Array(104, 116, 116, 112, 58, 47, 47, 105, 110,
115, 116, 97, 103, 114, 97, 109, 46, 99, 111, 109, 47, 112, 47, 120, 84, 50,
53, 72, 52, 111, 90, 95, 114, 47)), ArrayBuffer(Array(104, 116, 116, 112,
58, 47, 47, 101, 122, 101, 101, 99, 108, 97, 115, 115, 105, 102, 105, 101,
100, 97, 100, 115, 46, 99, 111, 109, 47, 47, 100, 101, 115, 99, 47, 106, 97,
105, 112, 117, 114, 47, 49, 48, 51, 54, 50, 50,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-nested-JSON-data-with-Spark-SQL-tp19310p20935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Reading nested JSON data with Spark SQL

2015-01-01 Thread Pankaj Narang
oops 

  sqlContext.setConf(spark.sql.parquet.binaryAsString, true)

thois solved the issue important for everyone



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-nested-JSON-data-with-Spark-SQL-tp19310p20936.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org