Regarding master node failure

2015-07-07 Thread swetha
Hi, What happens if the master node fails in the case of Spark Streaming? Would the data be lost? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Regarding-master-node-failure-tp23701.html Sent from the Apache Spark User List mailing list

How to set the number of executors and tasks in a Spark Streaming job in Mesos

2015-08-19 Thread swetha
Hi, How to set the number of executors and tasks in a Spark Streaming job in Mesos? I have the following settings but my job still shows me 11 active tasks and 11 executors. Any idea as to why this is happening ? sparkConf.set(spark.mesos.coarse, true) sparkConf.set(spark.cores.max, 128)

Failed to fetch block error

2015-08-18 Thread swetha
Hi, I see the following error in my Spark Job even after using like 100 cores and 16G memory. Did any of you experience the same problem earlier? 15/08/18 21:51:23 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block input-0-1439959114400, and will not retry (0 retries)

Re: Spark Streaming Json file groupby function

2015-07-28 Thread swetha
the performance be impacted? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Json-file-groupby-function-tp9618p24041.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Multiple UpdateStateByKey Functions in the same job?

2015-08-03 Thread swetha
keys with different return values? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-UpdateStateByKey-Functions-in-the-same-job-tp24119.html Sent from the Apache Spark User List mailing list archive at Nabble.com

How to add multiple sequence files from HDFS to a Spark Context to do Batch processing?

2015-07-31 Thread swetha
, classOf[LongWritable], classOf[Text]). map{case (x, y) = (x.toString, y.toString)} Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-multiple-sequence-files-from-HDFS-to-a-Spark-Context-to-do-Batch-processing-tp24102.html

What is the optimal approach to do Secondary Sort in Spark?

2015-08-11 Thread swetha
Hi, What is the optimal approach to do Secondary sort in Spark? I have to first Sort by an Id in the key and further sort it by timeStamp which is present in the value. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-optimal

Json parsing library for Spark Streaming?

2015-07-27 Thread swetha
Scala? val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap). map{case (x, y) = ((x.toString, Utils.toJsonObject(y.toString).get(request).getAsJsonObject().get(queryString).toString))} Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560

How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread swetha
with active sessions are still available for joining with those in the current job. So, what do we need to keep the data in memory in between two batch jobs? Can we use Tachyon? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs

How to maintain Windows of data along with maintaining session state using UpdateStateByKey

2015-07-24 Thread swetha
Streaming? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-maintain-Windows-of-data-along-with-maintaining-session-state-using-UpdateStateByKey-tp23986.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Is IndexedRDD available in Spark 1.4.0?

2015-07-14 Thread swetha
Hi, Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark Streaming to do lookups/updates/deletes in RDDs using keys by storing them as key/value pairs. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-IndexedRDD

Re: creating a distributed index

2015-07-14 Thread swetha
Hi Ankur, Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark Streaming to do lookups/updates/deletes in RDDs using keys by storing them as key/value pairs. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com

Sessionization using updateStateByKey

2015-07-14 Thread swetha
have to use any code like ssc.checkpoint(checkpointDir)? Also, how is the performance if I use both DStream Checkpointing for maintaining the state and use Kafka Direct approach for exactly once semantics? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560

Distributed caching of a file in SPark Streaming

2015-10-21 Thread swetha
? SparkContext.addFile() SparkFiles.get(fileName) Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Distributed-caching-of-a-file-in-SPark-Streaming-tp25157.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread swetha
Hi, We currently use reduceByKey to reduce by a particular metric name in our Streaming/Batch job. It seems to be doing a lot of shuffles and it has impact on performance. Does using a custompartitioner before calling reduceByKey improve performance? Thanks, Swetha -- View this message

Unwanted SysOuts in Spark Parquet

2015-11-08 Thread swetha
Hi, I see a lot of unwanted SysOuts when I try to save an RDD as parquet file. Following is the code and SysOuts. Any idea as to how to avoid the unwanted SysOuts? ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport]) AvroParquetOutputFormat.setSchema(job,

Spark IndexedRDD dependency in Maven

2015-11-09 Thread swetha
Hi , What is the appropriate dependency to include for Spark Indexed RDD? I get compilation error if I include 0.3 as the version as shown below: amplab spark-indexedrdd 0.3 Thanks, Swetha -- View this message in context: http://apache

parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-08 Thread swetha
Hi, I see unwanted Warning when I try to save a Parquet file in hdfs in Spark. Please find below the code and the Warning message. Any idea as to how to avoid the unwanted Warning message? activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void], classOf[ActiveSession],

How to unpersist a DStream in Spark Streaming

2015-11-04 Thread swetha
Hi, How to unpersist a DStream in Spark Streaming? I know that we can persist using dStream.persist() or dStream.cache. But, I don't see any method to unPersist. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-unpersist-a-DStream

Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-04 Thread swetha
formats would be of great help. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-approach-to-store-an-RDD-as-a-file-in-HDFS-and-read-it-back-as-an-RDD-tp25279.html Sent from the Apache Spark User List mailing list archive at Nabble.com

What is the efficient way to Join two RDDs?

2015-11-06 Thread swetha
Hi, What is the efficient way to join two RDDs? Would converting both the RDDs to IndexedRDDs be of any help? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-efficient-way-to-Join-two-RDDs-tp25310.html Sent from the Apache Spark

How to lookup by a key in an RDD

2015-10-31 Thread swetha
Hi, I have a requirement wherein I have to load data from hdfs, build an RDD and then lookup by key to do some updates to the value and then save it back to hdfs. How to lookup for a value using a key in an RDD? Thanks, Swetha -- View this message in context: http://apache-spark-user-list

Job splling to disk and memory in Spark Streaming

2015-10-20 Thread swetha
to do shuffles? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-splling-to-disk-and-memory-in-Spark-Streaming-tp25149.html Sent from the Apache Spark User List mailing list archive at Nabble.com

How to have Single refernce of a class in Spark Streaming?

2015-10-16 Thread swetha
(this.trackerClass) Some(newCount) } Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-have-Single-refernce-of-a-class-in-Spark-Streaming-tp25103.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Optimal way to avoid processing null returns in Spark Scala

2015-10-07 Thread swetha
Hi, I have the following functions that I am using for my job in Scala. If you see the getSessionId function I am returning null sometimes. If I return null the only way that I can avoid processing those records is by filtering out null records. I wanted to avoid having another pass for filtering

Streaming Job gives error after changing to version 1.5.2

2015-11-17 Thread swetha
Hi, I see java.lang.NoClassDefFoundError after changing the Streaming job version to 1.5.2. Any idea as to why this is happening? Following are my dependencies and the error that I get. org.apache.spark spark-core_2.10 ${sparkVersion}

Re: broadcast variable get cleaned by ContextCleaner unexpectedly ?

2015-09-10 Thread swetha
Hi, How is the ContextCleaner different from spark.cleaner.ttl?Is spark.cleaner.ttl when there is ContextCleaner in the Streaming job? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/broadcast-variable-get-cleaned-by-ContextCleaner

Spark streaming job filling a lot of data in local spark nodes

2015-09-28 Thread swetha
Sep 17 18:52 shuffle_23103_6_0.data -rw-r--r-- 1 371932812 Sep 17 18:52 shuffle_23125_6_0.data -rw-r--r-- 1 19857974 Sep 17 18:53 shuffle_23291_19_0.data -rw-r--r-- 1 55342005 Sep 17 18:53 shuffle_23305_8_0.data -rw-r--r-- 1 92920590 Sep 17 18:53 shuffle_23303_4_0.data Thanks, Swetha

Lost leader exception in Kafka Direct for Streaming

2015-09-30 Thread swetha
12. You either provided an invalid fromOffset, or the Kafka topic has been damaged Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-leader-exception-in-Kafka-Direct-for-Streaming-tp24891.html Sent from the Apache Spark User List mailing

Usage of transform for code reuse between Streaming and Batch job affects the performance ?

2015-10-04 Thread swetha
val groupedSessions = sessions.groupByKey(); val sortedSessions = groupedSessions.mapValues[(List[(Long, String)])](iter => iter.toList.sortBy(_._1)) * Does use of transform for code reuse affect groupByKey performance? Thanks, Swetha -- View this message in context: http://apache-spark-use

How to set System environment variables in Spark

2015-09-29 Thread swetha
variable when submitting a job in Spark? -Dcom.w1.p1.config.runOnEnv=dev Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-System-environment-variables-in-Spark-tp24875.html Sent from the Apache Spark User List mailing list archive

reduceByKey inside updateStateByKey in Spark Streaming???

2015-09-24 Thread swetha
Hi, How to use reduceByKey inside updateStateByKey? Suppose I have a bunch of keys for which I need to do sum and average inside the updateStateByKey by joining with old state. How do I accomplish that? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560

How to make Group By/reduceByKey more efficient?

2015-09-22 Thread swetha
Hi, How to make Group By more efficient? Is it recommended to use a custom partitioner and then do a Group By? And can we use a custom partitioner and then use a reduceByKey for optimization? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3

Spark Streaming checkpoint recovery throws Stack Overflow Error

2015-09-18 Thread swetha
Hi, When I try to recover my Spark Streaming job from a checkpoint directory, I get a StackOverFlow Error as shown below. Any idea as to why this is happening? 15/09/18 09:02:20 ERROR streaming.StreamingContext: Error starting the context, marking it as stopped java.lang.StackOverflowError

How to obtain the key in updateStateByKey

2015-09-23 Thread swetha
Hi, How to obtain the current key in updateStateBykey ? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-obtain-the-key-in-updateStateByKey-tp24792.html Sent from the Apache Spark User List mailing list archive at Nabble.com

How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-17 Thread swetha
Hi, How to return an RDD of key/value pairs from an RDD that has foreachPartition applied. I have my code something like the following. It looks like an RDD that has foreachPartition can have only the return type as Unit. How do I apply foreachPartition and do a save and at the same return a pair

How to clear the temp files that gets created by shuffle in Spark Streaming

2015-11-18 Thread swetha
Hi, We have a lot of temp files that gets created due to shuffles caused by group by. How to clear the files that gets created due to intermediate operations in group by? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-clear

FastUtil DataStructures in Spark

2015-11-19 Thread swetha
Hi, Has anybody used FastUtil equivalent to HashSet for Strings in Spark? Any example would be of great help. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/FastUtil-DataStructures-in-Spark-tp25429.html Sent from the Apache Spark User List

How to have Kafka Direct Consumers show up in Kafka Consumer reporting?

2015-11-23 Thread swetha
Hi, We see a bunch of issues like the following in Our Spark Kafka Direct. Any idea as to how make Kafka Direct Consumers show up in Kafka Consumer reporting to debug this issue? Job aborted due to stage failure: Task 47 in stage 336.0 failed 4 times, most recent failure: Lost task 47.3 in

Spark Kafka Direct Error

2015-11-23 Thread swetha
): java.lang.AssertionError: assertion failed: Ran out of messages before reaching ending offset 221572238 for topic hubble_stream partition 88 start 221563725. This should not happen, and indicates that messages may have been lost Thanks, Swetha -- View this message in context: http://apache

Re: Secondary Sorting in Spark

2015-10-25 Thread swetha kasireddy
Hi, Does the use of custom partitioner in Streaming affect performance? On Mon, Oct 5, 2015 at 1:06 PM, Adrian Tanase wrote: > Great article, especially the use of a custom partitioner. > > Also, sorting by multiple fields by creating a tuple out of them is an > awesome,

Re: Secondary Sorting in Spark

2015-10-26 Thread swetha kasireddy
ow it could affect performance. > > Used correctly it should improve performance as you can better control > placement of data and avoid shuffling… > > -adrian > > From: swetha kasireddy > Date: Monday, October 26, 2015 at 6:56 AM > To: Adrian Tanase > Cc: Bill Bejeck, "us

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread swetha kasireddy
me other scheme other than hash-based) then you need to > implement a custom partitioner. It can be used to improve data skews, etc. > which ultimately improves performance. > > On Tue, Oct 27, 2015 at 1:20 PM, swetha <swethakasire...@gmail.com> wrote: > >> Hi, >> >

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread swetha kasireddy
unt memory allocated for shuffles by changing > the configuration spark.shuffle.memoryFraction . More fraction would cause > less spilling. > > > On Tue, Oct 27, 2015 at 6:35 PM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> So, Wouldn't

Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-09 Thread swetha kasireddy
I am using the following: com.twitter parquet-avro 1.6.0 On Mon, Nov 9, 2015 at 1:00 AM, Fengdong Yu <fengdo...@everstring.com> wrote: > Which Spark version used? > > It was fixed in Parquet-1.7x, so Spark-1.5.x will be work. > > > > > > O

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-09 Thread swetha kasireddy
heckpoint directory is a good way to restart > the streaming job, you should stop the spark context or at the very least > kill the driver process, then restart. > > On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> Hi Cody, &g

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-09 Thread swetha kasireddy
t;c...@koeninger.org> wrote: > Without knowing more about what's being stored in your checkpoint > directory / what the log output is, it's hard to say. But either way, just > deleting the checkpoint directory probably isn't sufficient to restart the > job... > > On Mon, Nov 9,

Re: How to unpersist a DStream in Spark Streaming

2015-11-04 Thread swetha kasireddy
Other than setting the following. sparkConf.set("spark.streaming.unpersist", "true") sparkConf.set("spark.cleaner.ttl", "7200s") On Wed, Nov 4, 2015 at 5:03 PM, swetha <swethakasire...@gmail.com> wrote: > Hi, > > How to unpersist a DStream

Re: How to unpersist a DStream in Spark Streaming

2015-11-05 Thread swetha kasireddy
Its just in the same thread for a particular RDD, I need to uncache it every 2 minutes to clear out the data that is present in a Map inside that. On Wed, Nov 4, 2015 at 11:54 PM, Saisai Shao <sai.sai.s...@gmail.com> wrote: > Hi Swetha, > > Would you mind elaborating your

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
test/sql-programming-guide.html#parquet-files> > . > > On Thu, Nov 5, 2015 at 12:09 AM, swetha <swethakasire...@gmail.com> wrote: > >> Hi, >> >> What is the efficient approach to save an RDD as a file in HDFS and >> retrieve >> it back? I was thinkin

Re: What is the efficient way to Join two RDDs?

2015-11-06 Thread swetha kasireddy
> On Fri, Nov 6, 2015 at 3:21 PM, swetha <swethakasire...@gmail.com> wrote: > >> Hi, >> >> What is the efficient way to join two RDDs? Would converting both the RDDs >> to IndexedRDDs be of any help? >> >> Thanks, >> Swetha >> >> >

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
't be a problem imho. > in general hdfs is pretty fast, s3 is less so > the issue with storing data is that you will loose your partitioner(even > though rdd has it) at loading moment. There is PR that tries to solve this. > > > On 5 November 2015 at 01:09, swetha <sweth

Re: How to lookup by a key in an RDD

2015-11-02 Thread swetha kasireddy
Hi, Is Indexed RDDs released yet? Thanks, Swetha On Sun, Nov 1, 2015 at 1:21 AM, Gylfi <gy...@berkeley.edu> wrote: > Hi. > > You may want to look into Indexed RDDs > https://github.com/amplab/spark-indexedrdd > > Regards, > Gylfi. > > > > > >

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
t[T]]) sc.newAPIHadoopFile( parquetFile, classOf[ParquetInputFormat[T]], classOf[Void], tag.runtimeClass.asInstanceOf[Class[T]], jobConf) .map(_._2.asInstanceOf[T]) } On Thu, Nov 5, 2015 at 2:14 PM, swetha kasireddy <swethakasire...@gmail.com> wrote: > No scala. Sup

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
ng from java - toJavaRDD > <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#toJavaRDD()>* > () > > On 5 November 2015 at 21:13, swetha kasireddy <swethakasire...@gmail.com> > wrote: > >> How to convert a parquet file that is saved

Re: How to lookup by a key in an RDD

2015-11-05 Thread swetha kasireddy
I read about the IndexedRDD. Is the IndexedRDD join with another RDD that is not an IndexedRDD efficient? On Mon, Nov 2, 2015 at 9:56 PM, Deenar Toraskar <deenar.toras...@gmail.com> wrote: > Swetha > > Currently IndexedRDD is an external library and not part of Spark Core

Re: creating a distributed index

2015-11-06 Thread swetha kasireddy
an IndexedRDD for a certain set of data and then get those keys that are present in the IndexedRDD but not present in some other RDD. How would an IndexedRDD support such an usecase in an efficient manner? Thanks, Swetha On Wed, Jul 15, 2015 at 2:46 AM, Jem Tucker <jem.tuc...@gmail.com>

Re: How to put an object in cache for ever in Streaming

2015-10-16 Thread swetha kasireddy
ncache the > previous one, and cache a new one. > > TD > > On Fri, Oct 16, 2015 at 12:02 PM, swetha <swethakasire...@gmail.com> > wrote: > >> Hi, >> >> How to put a changing object in Cache for ever in Streaming. I know that >> we >> can do

Re: Lost leader exception in Kafka Direct for Streaming

2015-10-20 Thread swetha kasireddy
e manually screwed up a topic, or ... ? > > If you really want to just blindly "recover" from this situation (even > though something is probably wrong with your data), the most > straightforward thing to do is monitor and restart your job. > > > > > On W

Re: Streaming Job gives error after changing to version 1.5.2

2015-11-17 Thread swetha kasireddy
This error I see locally. On Tue, Nov 17, 2015 at 5:44 PM, Tathagata Das <t...@databricks.com> wrote: > Are you running 1.5.2-compiled jar on a Spark 1.5.2 cluster? > > On Tue, Nov 17, 2015 at 5:34 PM, swetha <swethakasire...@gmail.com> wrote: >

Re: Spark streaming job filling a lot of data in local spark nodes

2015-10-01 Thread swetha kasireddy
emp files. They are not > necessary for checkpointing and only stored in your local temp directory. > They will be stored in "/tmp" by default. You can use `spark.local.dir` to > set the path if you find your "/tmp" doesn't have enough space. > > Best Regards, > Shixiong Zhu >

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread swetha kasireddy
Hi Cody, How to look at Option 2(see the following)? Which portion of the code in Spark Kafka Direct to look at to handle this issue specific to our requirements. 2.Catch that exception and somehow force things to "reset" for that partition And how would it handle the offsets already calculated

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread swetha kasireddy
at 3:40 PM, Cody Koeninger <c...@koeninger.org> wrote: > KafkaRDD.scala , handleFetchErr > > On Tue, Dec 1, 2015 at 3:39 PM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> Hi Cody, >> >> How to look at Option 2(see the following)? Which

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread swetha kasireddy
Following is the Option 2 that I was talking about: 2.Catch that exception and somehow force things to "reset" for that partition And how would it handle the offsets already calculated in the backlog (if there is one)? On Tue, Dec 1, 2015 at 1:39 PM, swetha kasireddy <swethakasire

Re: Effective ways monitor and identify that a Streaming job has been failing for the last 5 minutes

2015-12-06 Thread swetha kasireddy
Any documentation/sample code on how to use Ganglia with Spark? On Sat, Dec 5, 2015 at 10:29 PM, manasdebashiskar wrote: > spark has capability to report to ganglia, graphite or jmx. > If none of that works for you you can register your own spark extra > listener > that

Re: Spark metrics for ganglia

2015-12-08 Thread swetha kasireddy
Hi, How to verify whether the GangliaSink directory got created? Thanks, Swetha On Mon, Dec 15, 2014 at 11:29 AM, danilopds <danilob...@gmail.com> wrote: > Thanks tsingfu, > > I used this configuration based in your post: (with ganglia unicast mode) > # Enable GangliaSink

Re: How to build Spark with Ganglia to enable monitoring using Ganglia

2015-12-07 Thread swetha kasireddy
g Ganglia? What is the > command for the same? > > Thanks, > Swetha > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-build-Spark-with-Ganglia-to-enable-monitoring-using-Ganglia-tp25625.html > Sent from the Apache Sp

Re: How to load partial data from HDFS using Spark SQL

2016-01-02 Thread swetha kasireddy
OK. What should the table be? Suppose I have a bunch of parquet files, do I just specify the directory as the table? On Fri, Jan 1, 2016 at 11:32 PM, UMESH CHAUDHARY wrote: > Ok, so whats wrong in using : > > var df=HiveContext.sql("Select * from table where id = ") >

Re: Spark batch getting hung up

2015-12-20 Thread swetha kasireddy
next stage/exits. Basically it happens when it has >> mapPartition/foreachPartition in a stage. Any idea as to why this is >> happening? >> >> Thanks, >> Swetha >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-lis

Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-25 Thread swetha kasireddy
; I am submitting my Spark job with supervise option as shown below. When I >> kill the driver and the app from UI, the driver does not restart >> automatically. This is in a cluster mode. Any suggestion on how to make >> Automatic Driver Restart work would be of great help. >

Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-28 Thread swetha kasireddy
code. > > > > > On Wed, Nov 25, 2015 at 12:57 PM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> I am killing my Streaming job using UI. What error code does UI provide >> if the job is killed from there? >> >> On Wed, Nov 25, 2015 at 11:

Re: Spark Kafka Direct Error

2015-11-24 Thread swetha kasireddy
:31 AM, Cody Koeninger <c...@koeninger.org> wrote: > No, the direct stream only communicates with Kafka brokers, not Zookeeper > directly. It asks the leader for each topicpartition what the highest > available offsets are, using the Kafka offset api. > > On Mon, Nov 23, 201

Re: Spark Kafka Direct Error

2015-11-24 Thread swetha kasireddy
re the partitions it's failing for all on the same leader? > Have there been any leader rebalances? > Do you have enough log retention? > If you log the offset for each message as it's processed, when do you see > the problem? > > On Tue, Nov 24, 2015 at 10:28 AM, swetha kasired

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-30 Thread swetha kasireddy
s? Can >> someone manually delete folders from the checkpoint folder to help the job >> recover? E.g. Go 2 steps back, hoping that kafka has those offsets. >> >> -adrian >> >> From: swetha kasireddy >> Date: Monday, November 9, 2015 at 10:40 PM >> To: Cody Koeni

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-30 Thread swetha kasireddy
your situation. > > The KafkaRDD will use the value of refresh.leader.backoff.ms, so you can > try adjusting that to get a longer sleep before retrying the task. > > On Mon, Nov 30, 2015 at 1:50 PM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> Hi Cody, &g

Re: How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-18 Thread swetha kasireddy
Looks like I can use mapPartitions but can it be done using forEachPartition? On Tue, Nov 17, 2015 at 11:51 PM, swetha <swethakasire...@gmail.com> wrote: > Hi, > > How to return an RDD of key/value pairs from an RDD that has > foreachPartition applied. I have my code something

Re: Streaming Job gives error after changing to version 1.5.2

2015-11-18 Thread swetha kasireddy
It works fine after some changes. -Thanks, Swetha On Tue, Nov 17, 2015 at 10:22 PM, Tathagata Das <t...@databricks.com> wrote: > Can you verify that the cluster is running the correct version of Spark. > 1.5.2. > > On Tue, Nov 17, 2015 at 7:23 PM, swetha kasireddy < > s

Re: Streaming Job gives error after changing to version 1.5.2

2015-11-19 Thread swetha kasireddy
That was actually an issue with our Mesos. On Wed, Nov 18, 2015 at 5:29 PM, Tathagata Das <t...@databricks.com> wrote: > If possible, could you give us the root cause and solution for future > readers of this thread. > > On Wed, Nov 18, 2015 at 6:37 AM, swetha kasiredd

Re: How to clear the temp files that gets created by shuffle in Spark Streaming

2015-11-19 Thread swetha kasireddy
TD's comment at the end. > > Cheers > > On Wed, Nov 18, 2015 at 7:28 PM, swetha <swethakasire...@gmail.com> wrote: > >> Hi, >> >> We have a lot of temp files that gets created due to shuffles caused by >> group by. How to clear the files tha

Re: How to have Kafka Direct Consumers show up in Kafka Consumer reporting?

2015-11-23 Thread swetha kasireddy
intln(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}") } ... } On Mon, Nov 23, 2015 at 6:31 PM, swetha kasireddy <swethakasire...@gmail.com > wrote: > Also, does Kafka direct query the offsets from the zookeeper directly? > From where does it get the offset

Re: Spark Kafka Direct Error

2015-11-23 Thread swetha kasireddy
led, the kafka leader > reported the ending offset was 221572238, but during processing, kafka > stopped returning messages before reaching that ending offset. > > That probably means something got screwed up with Kafka - e.g. you lost a > leader and lost messages in the proces

Re: How to have Kafka Direct Consumers show up in Kafka Consumer reporting?

2015-11-23 Thread swetha kasireddy
you mean by kafka consumer reporting? > > I'd log the offsets in your spark job and try running > > kafka-simple-consumer-shell.sh --partition $yourbadpartition > --print-offsets > > at the same time your spark job is running > > On Mon, Nov 23, 2015 at 7:37 PM, swet

Re: How to have Kafka Direct Consumers show up in Kafka Consumer reporting?

2015-11-23 Thread swetha kasireddy
Also, does Kafka direct query the offsets from the zookeeper directly? From where does it get the offsets? There is data in those offsets, but somehow Kafka Direct does not seem to pick it up? On Mon, Nov 23, 2015 at 6:18 PM, swetha kasireddy <swethakasire...@gmail.com > wrote: > I mea

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread swetha kasireddy
No, I am reading the data from hdfs, transforming it , registering the data in a temp table using registerTempTable and then doing insert overwrite using Spark SQl' hiveContext. On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh wrote: > how are you doing the insert?

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread swetha kasireddy
400 cores are assigned to this job. On Thu, Jun 9, 2016 at 1:16 PM, Stephen Boesch wrote: > How many workers (/cpu cores) are assigned to this job? > > 2016-06-09 13:01 GMT-07:00 SRK : > >> Hi, >> >> How to insert data into 2000

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread swetha kasireddy
erRecord, ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner """.stripMargin) On Mon, Jun 13, 2016 at 10:57 AM, swetha kasireddy < swethakasire...@gmail.com> wrote: > Hi Bijay, > > If I am hitting this issue, > https://issues.apache.or

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread swetha kasireddy
Hi Bijay, If I am hitting this issue, https://issues.apache.org/jira/browse/HIVE-11940. What needs to be done? Incrementing to higher version of hive is the only solution? Thanks! On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy < swethakasire...@gmail.com> wrote: > Hi, >

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread swetha kasireddy
e. >> >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-14 Thread swetha kasireddy
Hi Bijay, This approach might not work for me as I have to do partial inserts/overwrites in a given table and data_frame.write.partitionBy will overwrite the entire table. Thanks, Swetha On Mon, Jun 13, 2016 at 9:25 PM, Bijay Pathak <bijay.pat...@cloudwick.com> wrote: > Hi Swetha

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-15 Thread swetha kasireddy
Hi Mich, No I have not tried that. My requirement is to insert that from an hourly Spark Batch job. How is it different by trying to insert with Hive CLI or beeline? Thanks, Swetha On Tue, Jun 14, 2016 at 10:44 AM, Mich Talebzadeh <mich.talebza...@gmail.com > wrote: > Hi Swetha, >

Re: Kryo ClassCastException during Serialization/deserialization in Spark Streaming

2016-06-23 Thread swetha kasireddy
sampleMap is populated from inside a method that is getting called from updateStateByKey On Thu, Jun 23, 2016 at 1:13 PM, Ted Yu wrote: > Can you illustrate how sampleMap is populated ? > > Thanks > > On Thu, Jun 23, 2016 at 12:34 PM, SRK wrote:

Re: Driver not able to restart the job automatically after the application of Streaming with Kafka Direct went down

2016-02-05 Thread swetha kasireddy
uding > the --supervise option? > > > Thanks, > Swetha > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Driver-not-able-to-restart-the-job-automatically-after-the-application-of-Streaming-with-Kafka-Direcn-tp26155.h

Help needed in deleting a message posted in Spark User List

2016-02-05 Thread swetha kasireddy
Hi, I want to edit/delete a message posted in Spark User List. How do I do that? Thanks!

Re: How to join multiple tables and use subqueries in Spark SQL using sqlContext?

2016-02-23 Thread swetha kasireddy
It seems to be failing when I do something like following in both sqlContext and hiveContext sqlContext.sql("SELECT ssd.savedDate from saveSessionDatesRecs ssd where ssd.partitioner in (SELECT sr1.partitioner from sparkSessionRecords1 sr1))") On Tue, Feb 23, 2016 at 5:57 PM, swetha

Re: How to join multiple tables and use subqueries in Spark SQL using sqlContext?

2016-02-23 Thread swetha kasireddy
These tables are stored in hdfs as parquet. Can sqlContext be applied for the subQueries? On Tue, Feb 23, 2016 at 5:31 PM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > Assuming these are all in Hive, you can either use spark-sql or > spark-shell. > > HiveContext has

Re: How to join an RDD with a hive table?

2016-02-15 Thread swetha kasireddy
OK. would it only query for the records that I want in hive as per filter or just load the entire table? My user table will have millions of records and I do not want to cause OOM errors by loading the entire table in memory. On Mon, Feb 15, 2016 at 12:51 AM, Mich Talebzadeh

Re: How to join an RDD with a hive table?

2016-02-15 Thread swetha kasireddy
ENDAR_MONTH_DESC, t_c.CHANNEL_DESC > > ) rs > > LIMIT 10 > > > > [1998-01,Direct Sales,1823005210] > > [1998-01,Internet,248172522] > > [1998-01,Partners,474646900] > > [1998-02,Direct Sales,1819659036] > > [1998-02,Internet,298586496] > > [1998

Re: How to join an RDD with a hive table?

2016-02-16 Thread swetha kasireddy
How to use a customPartttioner hashed by userId inside saveAsTable using a dataframe? On Mon, Feb 15, 2016 at 11:24 AM, swetha kasireddy < swethakasire...@gmail.com> wrote: > How about saving the dataframe as a table partitioned by userId? My User > records have userId, number of ses

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread swetha kasireddy
a saveAsTable in a dataframe. >> >> >> Thanks, >> Swetha >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-d

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread swetha kasireddy
of a number of small files and also to be able to scan faster. Something like ...df.write.format("parquet").partitionBy( "userIdHash" , "userId").mode(SaveMode.Append).save("userRecords"); On Wed, Feb 17, 2016 at 2:16 PM, swetha kasireddy <swethakasire

  1   2   >