Interesting. I noticed that my drive log messages with time stamp, function name but no line number. However log message in other python files only contain the messages. All of my python code is a single zip file. The zip file is job submit argument
2022-01-21 19:45:02 WARN __main__:? - sparkConfig: ('spark.sql.cbo.enabled', 'true') 2022-01-21 19:48:34 WARN __main__:? - readsSparkDF.rdd.getNumPartitions():1698 __init__ BEGIN __init__ END run BEGIN run rawCountsSparkDF numRows:5387495 numCols:10409 My guess is somehow I need to change the way log4j is configure on the workers? Kind regards Andy From: Andrew Davidson <aedav...@ucsc.edu> Date: Thursday, January 20, 2022 at 2:32 PM To: "user @spark" <user@spark.apache.org> Subject: How to configure log4j in pyspark to get log level, file name, and line number Hi When I use python logging for my unit test. I am able to control the output format. I get the log level, the file and line number, then the msg [INFO testEstimatedScalingFactors.py:166 - test_B_convertCountsToInts()] BEGIN In my spark driver I am able to get the log4j logger spark = SparkSession\ .builder\ .appName("estimatedScalingFactors")\ .getOrCreate() # # https://medium.com/@lubna_22592/building-production-pyspark-jobs-5480d03fd71e # initialize logger for yarn cluster logs # log4jLogger = spark.sparkContext._jvm.org.apache.log4j logger = log4jLogger.LogManager.getLogger(__name__) However it only outputs the message. As a hack I have been adding the function names to the msg. I wonder if this is because of the way I make my python code available. When I submit my job using ‘$ gcloud dataproc jobs submit pyspark’ I pass my python file in a zip file --py-files ${extraPkg} I use level warn because the driver info logs are very verbose ############################################################################### def rowSums( self, countsSparkDF, columnNames ): self.logger.warn( "rowSums BEGIN" ) # https://stackoverflow.com/a/54283997/4586180 retDF = countsSparkDF.na.fill( 0 ).withColumn( "rowSum" , reduce( add, [col( x ) for x in columnNames] ) ) self.logger.warn( "rowSums retDF numRows:{} numCols:{}"\ .format( retDF.count(), len( retDF.columns ) ) ) self.logger.warn( "rowSums END\n" ) return retDF kind regards Andy