Are you running the job in yarn cluster mode?
On Oct 1, 2015 6:30 AM, "Jeetendra Gangele" wrote:
> We've a streaming application running on yarn and we would like to ensure
> that is up running 24/7.
>
> Is there a way to tell yarn to automatically restart a specific
>
I suggest taking a heap dump of driver process using jmap. Then open that
dump in a tool like Visual VM to see which object(s) are taking up heap
space. It is easy to do. We did this and found out that in our case it was
the data structure that stores info about stages, jobs and tasks. There can
Interesting. TD, can you please throw some light on why this is and point
to the relevant code in Spark repo. It will help in a better understanding
of things that can affect a long running streaming job.
On Aug 21, 2015 1:44 PM, Tathagata Das t...@databricks.com wrote:
Could you periodically
Is there a check you can put in place to not create pairs that aren't in
your set of 20M pairs? Additionally, once you have your arrays converted to
pairs you can do aggregateByKey with each pair being the key.
On Feb 20, 2015 1:57 PM, shlomib shl...@summerhq.com wrote:
Hi,
I am new to Spark
By default, the files will be created under the path provided as the
argument for saveAsTextFile. This argument is considered as a folder in the
bucket and actual files are created in it with the naming convention
part-n, where n is the number of output partition.
On Mon, Jan 26, 2015 at
Take a look at combine file input format. Repartition or coalesce could
introduce shuffle I/O overhead.
On Dec 16, 2014 7:09 AM, bethesda swearinge...@mac.com wrote:
Thank you! I had known about the small-files problem in HDFS but didn't
realize that it affected sc.textFile().
--
View
Likely this not the case here yet one thing to point out with Yarn
parameters like --num-executors is that they should be specified *before*
app jar and app args on spark-submit command line otherwise the app only
gets the default number of containers which is 2.
On Dec 5, 2014 12:22 PM, Sandy
This is a common use case and this is how Hadoop APIs for reading data
work, they return an Iterator [Your Record] instead of reading every record
in at once.
On Dec 1, 2014 9:43 PM, Andy Twigg andy.tw...@gmail.com wrote:
You may be able to construct RDDs directly from an iterator - not sure
-
This being a very broad topic, a discussion can quickly get subjective.
I'll try not to deviate from my experiences and observations to keep this
thread useful to those looking for answers.
I have used Hadoop MR (with Hive, MR Java apis, Cascading and Scalding) as
well as Spark (since v 0.6) in
Have you considered the garbage collection impact and if it coincides with
your latency spikes? You can enable gc logging by changing Spark
configuration for your job.
Hi, as I searched the keyword Total delay in the console log, the delay
keeps increasing. I am not sure what does this total delay
10 matches
Mail list logo