Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-04 Thread swetha
formats would be of great help. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-approach-to-store-an-RDD-as-a-file-in-HDFS-and-read-it-back-as-an-RDD-tp25279.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: How to unpersist a DStream in Spark Streaming

2015-11-04 Thread swetha kasireddy
Other than setting the following. sparkConf.set("spark.streaming.unpersist", "true") sparkConf.set("spark.cleaner.ttl", "7200s") On Wed, Nov 4, 2015 at 5:03 PM, swetha <swethakasire...@gmail.com> wrote: > Hi, > > How to unpersist a DStream

Re: How to lookup by a key in an RDD

2015-11-02 Thread swetha kasireddy
Hi, Is Indexed RDDs released yet? Thanks, Swetha On Sun, Nov 1, 2015 at 1:21 AM, Gylfi <gy...@berkeley.edu> wrote: > Hi. > > You may want to look into Indexed RDDs > https://github.com/amplab/spark-indexedrdd > > Regards, > Gylfi. > > > > > >

How to lookup by a key in an RDD

2015-10-31 Thread swetha
Hi, I have a requirement wherein I have to load data from hdfs, build an RDD and then lookup by key to do some updates to the value and then save it back to hdfs. How to lookup for a value using a key in an RDD? Thanks, Swetha -- View this message in context: http://apache-spark-user-list

Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread swetha
Hi, We currently use reduceByKey to reduce by a particular metric name in our Streaming/Batch job. It seems to be doing a lot of shuffles and it has impact on performance. Does using a custompartitioner before calling reduceByKey improve performance? Thanks, Swetha -- View this message

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread swetha kasireddy
me other scheme other than hash-based) then you need to > implement a custom partitioner. It can be used to improve data skews, etc. > which ultimately improves performance. > > On Tue, Oct 27, 2015 at 1:20 PM, swetha <swethakasire...@gmail.com> wrote: > >> Hi, >> >

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread swetha kasireddy
unt memory allocated for shuffles by changing > the configuration spark.shuffle.memoryFraction . More fraction would cause > less spilling. > > > On Tue, Oct 27, 2015 at 6:35 PM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> So, Wouldn't

Re: Secondary Sorting in Spark

2015-10-26 Thread swetha kasireddy
ow it could affect performance. > > Used correctly it should improve performance as you can better control > placement of data and avoid shuffling… > > -adrian > > From: swetha kasireddy > Date: Monday, October 26, 2015 at 6:56 AM > To: Adrian Tanase > Cc: Bill Bejeck, "us

Re: Secondary Sorting in Spark

2015-10-25 Thread swetha kasireddy
Hi, Does the use of custom partitioner in Streaming affect performance? On Mon, Oct 5, 2015 at 1:06 PM, Adrian Tanase wrote: > Great article, especially the use of a custom partitioner. > > Also, sorting by multiple fields by creating a tuple out of them is an > awesome,

Distributed caching of a file in SPark Streaming

2015-10-21 Thread swetha
? SparkContext.addFile() SparkFiles.get(fileName) Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Distributed-caching-of-a-file-in-SPark-Streaming-tp25157.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Job splling to disk and memory in Spark Streaming

2015-10-20 Thread swetha
to do shuffles? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-splling-to-disk-and-memory-in-Spark-Streaming-tp25149.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Lost leader exception in Kafka Direct for Streaming

2015-10-20 Thread swetha kasireddy
e manually screwed up a topic, or ... ? > > If you really want to just blindly "recover" from this situation (even > though something is probably wrong with your data), the most > straightforward thing to do is monitor and restart your job. > > > > > On W

How to have Single refernce of a class in Spark Streaming?

2015-10-16 Thread swetha
(this.trackerClass) Some(newCount) } Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-have-Single-refernce-of-a-class-in-Spark-Streaming-tp25103.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: How to put an object in cache for ever in Streaming

2015-10-16 Thread swetha kasireddy
ncache the > previous one, and cache a new one. > > TD > > On Fri, Oct 16, 2015 at 12:02 PM, swetha <swethakasire...@gmail.com> > wrote: > >> Hi, >> >> How to put a changing object in Cache for ever in Streaming. I know that >> we >> can do

Optimal way to avoid processing null returns in Spark Scala

2015-10-07 Thread swetha
Hi, I have the following functions that I am using for my job in Scala. If you see the getSessionId function I am returning null sometimes. If I return null the only way that I can avoid processing those records is by filtering out null records. I wanted to avoid having another pass for filtering

Usage of transform for code reuse between Streaming and Batch job affects the performance ?

2015-10-04 Thread swetha
val groupedSessions = sessions.groupByKey(); val sortedSessions = groupedSessions.mapValues[(List[(Long, String)])](iter => iter.toList.sortBy(_._1)) * Does use of transform for code reuse affect groupByKey performance? Thanks, Swetha -- View this message in context: http://apache-spark-use

Re: Spark streaming job filling a lot of data in local spark nodes

2015-10-01 Thread swetha kasireddy
emp files. They are not > necessary for checkpointing and only stored in your local temp directory. > They will be stored in "/tmp" by default. You can use `spark.local.dir` to > set the path if you find your "/tmp" doesn't have enough space. > > Best Regards, > Shixiong Zhu >

Lost leader exception in Kafka Direct for Streaming

2015-09-30 Thread swetha
12. You either provided an invalid fromOffset, or the Kafka topic has been damaged Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-leader-exception-in-Kafka-Direct-for-Streaming-tp24891.html Sent from the Apache Spark User List mailing

How to set System environment variables in Spark

2015-09-29 Thread swetha
variable when submitting a job in Spark? -Dcom.w1.p1.config.runOnEnv=dev Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-System-environment-variables-in-Spark-tp24875.html Sent from the Apache Spark User List mailing list archive

Spark streaming job filling a lot of data in local spark nodes

2015-09-28 Thread swetha
Sep 17 18:52 shuffle_23103_6_0.data -rw-r--r-- 1 371932812 Sep 17 18:52 shuffle_23125_6_0.data -rw-r--r-- 1 19857974 Sep 17 18:53 shuffle_23291_19_0.data -rw-r--r-- 1 55342005 Sep 17 18:53 shuffle_23305_8_0.data -rw-r--r-- 1 92920590 Sep 17 18:53 shuffle_23303_4_0.data Thanks, Swetha

reduceByKey inside updateStateByKey in Spark Streaming???

2015-09-24 Thread swetha
Hi, How to use reduceByKey inside updateStateByKey? Suppose I have a bunch of keys for which I need to do sum and average inside the updateStateByKey by joining with old state. How do I accomplish that? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560

How to obtain the key in updateStateByKey

2015-09-23 Thread swetha
Hi, How to obtain the current key in updateStateBykey ? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-obtain-the-key-in-updateStateByKey-tp24792.html Sent from the Apache Spark User List mailing list archive at Nabble.com

How to make Group By/reduceByKey more efficient?

2015-09-22 Thread swetha
Hi, How to make Group By more efficient? Is it recommended to use a custom partitioner and then do a Group By? And can we use a custom partitioner and then use a reduceByKey for optimization? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3

Spark Streaming checkpoint recovery throws Stack Overflow Error

2015-09-18 Thread swetha
Hi, When I try to recover my Spark Streaming job from a checkpoint directory, I get a StackOverFlow Error as shown below. Any idea as to why this is happening? 15/09/18 09:02:20 ERROR streaming.StreamingContext: Error starting the context, marking it as stopped java.lang.StackOverflowError

Re: broadcast variable get cleaned by ContextCleaner unexpectedly ?

2015-09-10 Thread swetha
Hi, How is the ContextCleaner different from spark.cleaner.ttl?Is spark.cleaner.ttl when there is ContextCleaner in the Streaming job? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/broadcast-variable-get-cleaned-by-ContextCleaner

How to set the number of executors and tasks in a Spark Streaming job in Mesos

2015-08-19 Thread swetha
Hi, How to set the number of executors and tasks in a Spark Streaming job in Mesos? I have the following settings but my job still shows me 11 active tasks and 11 executors. Any idea as to why this is happening ? sparkConf.set(spark.mesos.coarse, true) sparkConf.set(spark.cores.max, 128)

Failed to fetch block error

2015-08-18 Thread swetha
Hi, I see the following error in my Spark Job even after using like 100 cores and 16G memory. Did any of you experience the same problem earlier? 15/08/18 21:51:23 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block input-0-1439959114400, and will not retry (0 retries)

What is the optimal approach to do Secondary Sort in Spark?

2015-08-11 Thread swetha
Hi, What is the optimal approach to do Secondary sort in Spark? I have to first Sort by an Id in the key and further sort it by timeStamp which is present in the value. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-optimal

Multiple UpdateStateByKey Functions in the same job?

2015-08-03 Thread swetha
keys with different return values? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-UpdateStateByKey-Functions-in-the-same-job-tp24119.html Sent from the Apache Spark User List mailing list archive at Nabble.com

How to add multiple sequence files from HDFS to a Spark Context to do Batch processing?

2015-07-31 Thread swetha
, classOf[LongWritable], classOf[Text]). map{case (x, y) = (x.toString, y.toString)} Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-multiple-sequence-files-from-HDFS-to-a-Spark-Context-to-do-Batch-processing-tp24102.html

Re: Spark Streaming Json file groupby function

2015-07-28 Thread swetha
the performance be impacted? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Json-file-groupby-function-tp9618p24041.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Json parsing library for Spark Streaming?

2015-07-27 Thread swetha
Scala? val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap). map{case (x, y) = ((x.toString, Utils.toJsonObject(y.toString).get(request).getAsJsonObject().get(queryString).toString))} Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560

How to maintain Windows of data along with maintaining session state using UpdateStateByKey

2015-07-24 Thread swetha
Streaming? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-maintain-Windows-of-data-along-with-maintaining-session-state-using-UpdateStateByKey-tp23986.html Sent from the Apache Spark User List mailing list archive at Nabble.com

How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread swetha
with active sessions are still available for joining with those in the current job. So, what do we need to keep the data in memory in between two batch jobs? Can we use Tachyon? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs

Is IndexedRDD available in Spark 1.4.0?

2015-07-14 Thread swetha
Hi, Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark Streaming to do lookups/updates/deletes in RDDs using keys by storing them as key/value pairs. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-IndexedRDD

Re: creating a distributed index

2015-07-14 Thread swetha
Hi Ankur, Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark Streaming to do lookups/updates/deletes in RDDs using keys by storing them as key/value pairs. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com

Sessionization using updateStateByKey

2015-07-14 Thread swetha
have to use any code like ssc.checkpoint(checkpointDir)? Also, how is the performance if I use both DStream Checkpointing for maintaining the state and use Kafka Direct approach for exactly once semantics? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560

Regarding master node failure

2015-07-07 Thread swetha
Hi, What happens if the master node fails in the case of Spark Streaming? Would the data be lost? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Regarding-master-node-failure-tp23701.html Sent from the Apache Spark User List mailing list

<    1   2