The error is in the Spark Standalone Worker. It's hitting an OOM while
launching/running an executor process. Specifically it's running out of
memory when parsing the hadoop configuration trying to figure out the
env/command line to run

https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala#L142-L149

Now usually this is something that I wouldn't expect to happen, since a
Spark Worker is generally a very lightweight process. Unless it was
accumulating a lot of state it should be relatively small and it should be
very unlikely that generating a command
line string would cause this error unless the application configuration was
gigantic. So while it's possible you just have very large hadoop.xml files
it is probably not this specific action that is ooming, but rather this is
the straw that broke
the camel's back and the worker just has too much other state.

This may not be pathologic, and it may just be you are running a lot of
executors, or it's keeping track of lots of started and shutdown executor
metadata or something like that and it's not a big deal.
You could fix this by limiting the amount of metadata preserved after jobs
are run see (spark.deploy.* options for retaining apps and spark worker
cleanup)
or by increasing the  Spark Worker's heap (SPARK_DAEMON_MEMORY)

If I hit this I would start by bumping Daemon Memory

On Fri, May 8, 2020 at 11:59 AM Hrishikesh Mishra <sd.hri...@gmail.com>
wrote:

> We submit spark job through spark-submit command, Like below one.
>
>
> sudo /var/lib/pf-spark/bin/spark-submit \
> --total-executor-cores 30 \
> --driver-cores 2 \
> --class com.hrishikesh.mishra.Main\
> --master spark://XX.XX.XXX.19:6066  \
> --deploy-mode cluster  \
> --supervise
> http://XX.XX.XXX.19:90/jar/fk-runner-framework-1.0-SNAPSHOT.jar
>
>
>
>
> We have python http server, where we hosted all jars.
>
> The user kill the driver driver-20200508153502-1291 and its visible in log
> also, but this is not problem. OOM is separate from this.
>
> 20/05/08 15:36:55 INFO Worker: Asked to kill driver
> driver-20200508153502-1291
>
> 20/05/08 15:36:55 INFO DriverRunner: Killing driver process!
>
> 20/05/08 15:36:55 INFO CommandUtils: Redirection to
> /grid/1/spark/work/driver-20200508153502-1291/stderr closed: Stream closed
>
> 20/05/08 15:36:55 INFO CommandUtils: Redirection to
> /grid/1/spark/work/driver-20200508153502-1291/stdout closed: Stream closed
>
> 20/05/08 15:36:55 INFO ExternalShuffleBlockResolver: Application
> app-20200508153654-11776 removed, cleanupLocalDirs = true
>
> 20/05/08 *15:36:55* INFO Worker: Driver* driver-20200508153502-1291 was
> killed by user*
>
> *20/05/08 15:43:06 WARN AbstractChannelHandlerContext: An exception
> 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
> stacktrace] was thrown by a user handler's exceptionCaught() method while
> handling the following exception:*
>
> *java.lang.OutOfMemoryError: Java heap space*
>
> *20/05/08 15:43:23 ERROR SparkUncaughtExceptionHandler: Uncaught exception
> in thread Thread[dispatcher-event-loop-6,5,main]*
>
> *java.lang.OutOfMemoryError: Java heap space*
>
> *20/05/08 15:43:17 WARN AbstractChannelHandlerContext: An exception
> 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
> stacktrace] was thrown by a user handler's exceptionCaught() method while
> handling the following exception:*
>
> *java.lang.OutOfMemoryError: Java heap space*
>
> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>
> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>
> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>
> 20/05/08 15:43:33 INFO ShutdownHookManager: Shutdown hook called
>
> 20/05/08 15:43:33 INFO ShutdownHookManager: Deleting directory
> /grid/1/spark/local/spark-e045e069-e126-4cff-9512-d36ad30ee922
>
>
> On Fri, May 8, 2020 at 9:27 PM Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi,
>>
>> It's been a while since I worked with Spark Standalone, but I'd check the
>> logs of the workers. How do you spark-submit the app?
>>
>> DId you check /grid/1/spark/work/driver-20200508153502-1291 directory?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books <https://books.japila.pl/>
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> <https://twitter.com/jaceklaskowski>
>>
>>
>> On Fri, May 8, 2020 at 2:32 PM Hrishikesh Mishra <sd.hri...@gmail.com>
>> wrote:
>>
>>> Thanks Jacek for quick response.
>>> Due to our system constraints, we can't move to Structured Streaming
>>> now. But definitely YARN can be tried out.
>>>
>>> But my problem is I'm able to figure out where is the issue, Driver,
>>> Executor, or Worker. Even exceptions are clueless.  Please see the below
>>> exception, I'm unable to spot the issue for OOM.
>>>
>>> 20/05/08 15:36:55 INFO Worker: Asked to kill driver
>>> driver-20200508153502-1291
>>>
>>> 20/05/08 15:36:55 INFO DriverRunner: Killing driver process!
>>>
>>> 20/05/08 15:36:55 INFO CommandUtils: Redirection to
>>> /grid/1/spark/work/driver-20200508153502-1291/stderr closed: Stream closed
>>>
>>> 20/05/08 15:36:55 INFO CommandUtils: Redirection to
>>> /grid/1/spark/work/driver-20200508153502-1291/stdout closed: Stream closed
>>>
>>> 20/05/08 15:36:55 INFO ExternalShuffleBlockResolver: Application
>>> app-20200508153654-11776 removed, cleanupLocalDirs = true
>>>
>>> 20/05/08 15:36:55 INFO Worker: Driver driver-20200508153502-1291 was
>>> killed by user
>>>
>>> *20/05/08 15:43:06 WARN AbstractChannelHandlerContext: An exception
>>> 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
>>> stacktrace] was thrown by a user handler's exceptionCaught() method while
>>> handling the following exception:*
>>>
>>> *java.lang.OutOfMemoryError: Java heap space*
>>>
>>> *20/05/08 15:43:23 ERROR SparkUncaughtExceptionHandler: Uncaught
>>> exception in thread Thread[dispatcher-event-loop-6,5,main]*
>>>
>>> *java.lang.OutOfMemoryError: Java heap space*
>>>
>>> *20/05/08 15:43:17 WARN AbstractChannelHandlerContext: An exception
>>> 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
>>> stacktrace] was thrown by a user handler's exceptionCaught() method while
>>> handling the following exception:*
>>>
>>> *java.lang.OutOfMemoryError: Java heap space*
>>>
>>> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>>>
>>> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>>>
>>> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>>>
>>> 20/05/08 15:43:33 INFO ShutdownHookManager: Shutdown hook called
>>>
>>> 20/05/08 15:43:33 INFO ShutdownHookManager: Deleting directory
>>> /grid/1/spark/local/spark-e045e069-e126-4cff-9512-d36ad30ee922
>>>
>>>
>>>
>>>
>>> On Fri, May 8, 2020 at 5:14 PM Jacek Laskowski <ja...@japila.pl> wrote:
>>>
>>>> Hi,
>>>>
>>>> Sorry for being perhaps too harsh, but when you asked "Am I missing
>>>> something. " and I noticed this "Kafka Direct Stream" and "Spark Standalone
>>>> Cluster. " I immediately thought "Yeah...please upgrade your Spark env to
>>>> use Spark Structured Streaming at the very least and/or use YARN as the
>>>> cluster manager".
>>>>
>>>> Another thought was that the user code (your code) could be leaking
>>>> resources so Spark eventually reports heap-related errors that may not
>>>> necessarily be Spark's.
>>>>
>>>> Pozdrawiam,
>>>> Jacek Laskowski
>>>> ----
>>>> https://about.me/JacekLaskowski
>>>> "The Internals Of" Online Books <https://books.japila.pl/>
>>>> Follow me on https://twitter.com/jaceklaskowski
>>>>
>>>> <https://twitter.com/jaceklaskowski>
>>>>
>>>>
>>>> On Thu, May 7, 2020 at 1:12 PM Hrishikesh Mishra <sd.hri...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I am getting out of memory error in worker log in streaming jobs in
>>>>> every couple of hours. After this worker dies. There is no shuffle, no
>>>>> aggression, no. caching  in job, its just a transformation.
>>>>> I'm not able to identify where is the problem, driver or executor. And
>>>>> why worker getting dead after the OOM streaming job should die. Am I
>>>>> missing something.
>>>>>
>>>>> Driver Memory:  2g
>>>>> Executor memory: 4g
>>>>>
>>>>> Spark Version:  2.4
>>>>> Kafka Direct Stream
>>>>> Spark Standalone Cluster.
>>>>>
>>>>>
>>>>> 20/05/06 12:52:20 INFO SecurityManager: SecurityManager:
>>>>> authentication disabled; ui acls disabled; users  with view permissions:
>>>>> Set(root); groups with view permissions: Set(); users  with modify
>>>>> permissions: Set(root); groups with modify permissions: Set()
>>>>>
>>>>> 20/05/06 12:53:03 ERROR SparkUncaughtExceptionHandler: Uncaught
>>>>> exception in thread Thread[ExecutorRunner for
>>>>> app-20200506124717-10226/0,5,main]
>>>>>
>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>
>>>>> at org.apache.xerces.util.XMLStringBuffer.append(Unknown Source)
>>>>>
>>>>> at org.apache.xerces.impl.XMLEntityScanner.scanData(Unknown Source)
>>>>>
>>>>> at org.apache.xerces.impl.XMLScanner.scanComment(Unknown Source)
>>>>>
>>>>> at
>>>>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanComment(Unknown
>>>>> Source)
>>>>>
>>>>> at
>>>>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
>>>>> Source)
>>>>>
>>>>> at
>>>>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
>>>>> Source)
>>>>>
>>>>> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>>>>>
>>>>> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>>>>>
>>>>> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>>>>>
>>>>> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>>>>>
>>>>> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>>>>>
>>>>> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150)
>>>>>
>>>>> at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2480)
>>>>>
>>>>> at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2468)
>>>>>
>>>>> at
>>>>> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2539)
>>>>>
>>>>> at
>>>>> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2492)
>>>>>
>>>>> at
>>>>> org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)
>>>>>
>>>>> at org.apache.hadoop.conf.Configuration.set(Configuration.java:1143)
>>>>>
>>>>> at org.apache.hadoop.conf.Configuration.set(Configuration.java:1115)
>>>>>
>>>>> at
>>>>> org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopConfigurations(SparkHadoopUtil.scala:464)
>>>>>
>>>>> at
>>>>> org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:436)
>>>>>
>>>>> at
>>>>> org.apache.spark.deploy.SparkHadoopUtil.newConfiguration(SparkHadoopUtil.scala:114)
>>>>>
>>>>> at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:114)
>>>>>
>>>>> at org.apache.spark.deploy.worker.ExecutorRunner.org
>>>>> $apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:149)
>>>>>
>>>>> at
>>>>> org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)
>>>>>
>>>>> 20/05/06 12:53:38 INFO DriverRunner: Worker shutting down, killing
>>>>> driver driver-20200505181719-1187
>>>>>
>>>>> 20/05/06 12:53:38 INFO DriverRunner: Killing driver process!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Regards
>>>>> Hrishi
>>>>>
>>>>

Reply via email to