Apache Spark Log4j logging applicationId

2019-07-23 Thread Luca Borin
Hi,

I would like to add the applicationId to all logs produced by Spark through
Log4j. Consider that I have a cluster with several jobs running in it, so
the presence of the applicationId would be useful to logically divide them.

I have found a partial solution. If I change the layout of the
PatternLayout logger, I can add the print of the ThreadContext (see here
), which
can be used to add through MDC the information of the applicationId (see
here
).
This works for the driver, but I would like to add this information at
Spark application startup, both for driver and workers. Notice that I'm
working with a managed environment (Databricks), so I'm partially limited
in cluster management. One workaround to execute the put of the parameter
through MDC to all workers is to use a broadcast variable and perform an
action with it, but I don't think it is stable, considering that this
should work also if the worker machine restarts or is substituted.

Thank you


Spark event logging with s3a

2018-11-08 Thread David Hesson
We are trying to use spark event logging with s3a as a destination for event 
data.

We added these settings to the spark submits:

spark.eventLog.dir s3a://ourbucket/sparkHistoryServer/eventLogs
spark.eventLog.enabled true

Everything works fine with smaller jobs, and we can see the history data in the 
history server that’s also using s3a. However, when we tried a job with a few 
hundred gigs of data that goes through multiple stages, it was dying with OOM 
exception (same job works fine with spark.eventLog.enabled false)

18/10/22 23:07:22 ERROR util.Utils: uncaught error in thread SparkListenerBus, 
stopping SparkContext
java.lang.OutOfMemoryError
at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)

Full stack trace: 
https://gist.github.com/davidhesson/bd64a25f04c6bb241ec398f5383d671c

Does anyone have any insight or experience with using spark history server with 
s3a? Is this problem being caused by perhaps something else in our configs? Any 
help would be appreciated.


Spark Streaming logging on Yarn : issue with rolling in yarn-client mode for driver log

2018-03-07 Thread chandan prakash
Hi All,
I am running my spark streaming in yarn-client mode.
I want to enable rolling and aggregation  in node manager container.
I am using configs as suggested in spark doc
:
${spark.yarn.app.container.log.dir}/spark.log  in log4j.properties

Also for Aggregation on yarn I have enabled these properties :
spark.yarn.rolledLog.includePattern=spark*
yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds=3600
 on spark and yarn side respectively.

At executors, my logs are getting rolled up and aggregated after every 1
hour as expected.
*But the issue is:*
 for driver,  in yarn-client mode, ${spark.yarn.app.container.log.dir} value
is not available when driver starts and so for driver ,so I am not able to
see driver logs in yarn app container directory.
My restrictions are:
1. want to use yarn-client mode only
2. want to enable logging in yarn container only so that it is aggregated
and backed up by yarn every hour to hdfs/s3

*How can I get a workaround this to enable driver logs rolling and
aggregation as well?*

Any pointers will be helpful.
thanks in advance.

-- 
Chandan Prakash


Spark and logging

2015-05-27 Thread dgoldenberg
I'm wondering how logging works in Spark.

I see that there's the log4j.properties.template file in the conf directory. 
Safe to assume Spark is using log4j 1?  What's the approach if we're using
log4j 2?  I've got a log4j2.xml file in the job jar which seems to be
working for my log statements but Spark's logging seems to be taking its own
default route despite me setting Spark's log to 'warn' only.

More interestingly, what happens if file-based loggers are at play?

If a log statement is in the driver program I assume it'll get logged into a
log file that's collocated with the driver. What about log statements in the
partition processing functions?  Will their log statements get logged into a
file residing on a given 'slave' machine, or will Spark capture this log
output and divert it into the log file of the driver's machine?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-logging-tp23049.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark and logging

2015-05-27 Thread Imran Rashid
only an answer to one of your questions:


What about log statements in the
 partition processing functions?  Will their log statements get logged into
 a
 file residing on a given 'slave' machine, or will Spark capture this log
 output and divert it into the log file of the driver's machine?


they get logged to files on the remote nodes.  You can view the logs for
each executor through the UI.  If you are using spark on yarn, you can grab
all the logs with yarn logs.