I have a python script that is used to submit spark jobs using the spark-submit tool. I want to execute the command and write the output both to STDOUT and a logfile in real time. i'm using python 2.7 on a ubuntu server.
This is what I have so far in my SubmitJob.py script #!/usr/bin/python # Submit the commanddef submitJob(cmd, log_file): with open(log_file, 'w') as fh: process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) while True: output = process.stdout.readline() if output == '' and process.poll() is not None: break if output: print output.strip() fh.write(output) rc = process.poll() return rc if __name__ == "__main__": cmdList = ["dse", "spark-submit", "--spark-master", "spark://127.0.0.1:7077", "--class", "com.spark.myapp", "./myapp.jar"] log_file = "/tmp/out.log" exist_status = submitJob(cmdList, log_file) print "job finished with status ",exist_status The strange thing is, when I execute the same command directly in the shell it works fine and produces output on screen as the program proceeds. So it looks like something is wrong in the way I'm using the subprocess.PIPE for stdout and writing the file. What's the current recommended way to use subprocess module for writing to stdout and log file in real time line by line? I see a lot of different options on the internet but not sure which is correct or latest. Is there anything specific to the way spark-submit buffers the stdout that I need to take care of? thanks