Re: automatic start of streaming job on failure on YARN

2015-10-02 Thread Ashish Rangole
Are you running the job in yarn cluster mode? On Oct 1, 2015 6:30 AM, "Jeetendra Gangele" wrote: > We've a streaming application running on yarn and we would like to ensure > that is up running 24/7. > > Is there a way to tell yarn to automatically restart a specific >

Re: Driver running out of memory - caused by many tasks?

2015-08-27 Thread Ashish Rangole
I suggest taking a heap dump of driver process using jmap. Then open that dump in a tool like Visual VM to see which object(s) are taking up heap space. It is easy to do. We did this and found out that in our case it was the data structure that stores info about stages, jobs and tasks. There can

Re: Worker Machine running out of disk for Long running Streaming process

2015-08-22 Thread Ashish Rangole
Interesting. TD, can you please throw some light on why this is and point to the relevant code in Spark repo. It will help in a better understanding of things that can affect a long running streaming job. On Aug 21, 2015 1:44 PM, Tathagata Das t...@databricks.com wrote: Could you periodically

Re: randomSplit instead of a huge map reduce ?

2015-02-20 Thread Ashish Rangole
Is there a check you can put in place to not create pairs that aren't in your set of 20M pairs? Additionally, once you have your arrays converted to pairs you can do aggregateByKey with each pair being the key. On Feb 20, 2015 1:57 PM, shlomib shl...@summerhq.com wrote: Hi, I am new to Spark

Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Ashish Rangole
By default, the files will be created under the path provided as the argument for saveAsTextFile. This argument is considered as a folder in the bucket and actual files are created in it with the naming convention part-n, where n is the number of output partition. On Mon, Jan 26, 2015 at

Re: Why so many tasks?

2014-12-16 Thread Ashish Rangole
Take a look at combine file input format. Repartition or coalesce could introduce shuffle I/O overhead. On Dec 16, 2014 7:09 AM, bethesda swearinge...@mac.com wrote: Thank you! I had known about the small-files problem in HDFS but didn't realize that it affected sc.textFile(). -- View

Re: spark-submit on YARN is slow

2014-12-05 Thread Ashish Rangole
Likely this not the case here yet one thing to point out with Yarn parameters like --num-executors is that they should be specified *before* app jar and app args on spark-submit command line otherwise the app only gets the default number of containers which is 2. On Dec 5, 2014 12:22 PM, Sandy

Re: Loading RDDs in a streaming fashion

2014-12-02 Thread Ashish Rangole
This is a common use case and this is how Hadoop APIs for reading data work, they return an Iterator [Your Record] instead of reading every record in at once. On Dec 1, 2014 9:43 PM, Andy Twigg andy.tw...@gmail.com wrote: You may be able to construct RDDs directly from an iterator - not sure -

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Ashish Rangole
This being a very broad topic, a discussion can quickly get subjective. I'll try not to deviate from my experiences and observations to keep this thread useful to those looking for answers. I have used Hadoop MR (with Hive, MR Java apis, Cascading and Scalding) as well as Spark (since v 0.6) in

Re: Problem in Spark Streaming

2014-06-10 Thread Ashish Rangole
Have you considered the garbage collection impact and if it coincides with your latency spikes? You can enable gc logging by changing Spark configuration for your job. Hi, as I searched the keyword Total delay in the console log, the delay keeps increasing. I am not sure what does this total delay