[ https://issues.apache.org/jira/browse/SPARK-21094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
holdenk resolved SPARK-21094. ----------------------------- Resolution: Fixed Fix Version/s: 3.0.0 > Allow stdout/stderr pipes in pyspark.java_gateway.launch_gateway > ---------------------------------------------------------------- > > Key: SPARK-21094 > URL: https://issues.apache.org/jira/browse/SPARK-21094 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 2.1.1 > Reporter: Peter Parente > Assignee: Peter Parente > Priority: Major > Fix For: 3.0.0 > > > The Popen call to launch the py4j gateway specifies no stdout and stderr > options, meaning logging from the JVM always goes to the parent process > terminal. > https://github.com/apache/spark/blob/v2.1.1/python/pyspark/java_gateway.py#L77 > It would be super handy if the launch_gateway function took an additional > dict parameter called popen_kwargs and passed it along to the Popen calls. > This API enhancement, for example, would allow Python applications to capture > all stdout and stderr coming from Spark and process it programmatically, > without resorting to reading from log files or other hijinks. > Example use: > {code} > import pyspark > import subprocess > from pyspark.java_gateway import launch_gateway > # Make the py4j JVM stdout and stderr available without buffering > popen_kwargs = { > 'stdout': subprocess.PIPE, > 'stderr': subprocess.PIPE, > 'bufsiz': 0 > } > # Launch the gateway with our custom settings > gateway = launch_gateway(popen_kwargs=popen_kwargs) > # Use the gateway we launched > sc = pyspark.SparkContext(gateway=gateway) > # This could be done in a thread or event loop or ... > # Written briefly / poorly here only as a demo > while True: > buf = gateway.proc.stdout.read() > print(buf.decode('utf-8')) > {code} > To get access to the stdout and stderr pipes, the "proc" instance created in > launch_gateway also needs to be exposed to the application. I'm thinking that > stashing it on the JavaGateway instance that the function already returns is > the cleanest from the client perspective, but means hanging an extra > attribute off the py4j.JavaGateway object. > I can submit a PR with this addition for further discussion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org