I repost this because it didn’t appear on the mailing list board. These are the step needed to reproduce the error and to track down the log message.
1) I started a brand new instance of zeppelin issuing: service zeppelin start and started a bash script that tracks down R processes activity. After running a simple R script from Zeppelin, the R interpreter process was started: Mon May 8 11:27:59 CEST 2017 >>> R started 2) I left the browser open and at 12:26:15 I closed the browser. Zeppelin tracked down the connection being closed: INFO [2017-05-08 12:26:15,879] ({qtp423031029-60} NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 : 33798. (1001) null 3) At 13:08:00 R was closed. My script returned: Mon May 8 13:08:00 CEST 2017 >>> R stopped This is the output from the interpreter log file (deleted non-useful lines): INFO [2017-05-08 11:27:43,632] ({Thread-0} RemoteInterpreterServer.java[run]:95) - Starting remote interpreter server on port 45227 INFO [2017-05-08 11:27:44,600] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkInterpreter INFO [2017-05-08 11:27:44,624] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkSqlInterpreter INFO [2017-05-08 11:27:44,629] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.DepInterpreter INFO [2017-05-08 11:27:44,640] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.PySparkInterpreter INFO [2017-05-08 11:27:44,643] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkRInterpreter ... INFO [2017-05-08 11:28:00,188] ({pool-2-thread-2} SchedulerFactory.java[jobFinished]:137) - Job remoteInterpretJob_1494235664723 finished by scheduler org.apache.zeppelin.spark.SparkRInterpreter2097894179 DEBUG [2017-05-08 11:28:00,819] ({pool-1-thread-3} RemoteInterpreterServer.java[resourcePoolGetAll]:911) - Request getAll from ZeppelinServer DEBUG [2017-05-08 13:08:00,187] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:Error in handleErrors(returnStatus, conn) : DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output: No status is returned. Java SparkR backend might have failed. DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:Calls: <Anonymous> -> invokeJava -> handleErrors DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:Execution halted This is the output from zeppelin log file (it didn't track the R interpreter failure): INFO [2017-05-08 11:28:00,221] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:2056) - Job 20170506-145151_1585482989 is finished successfully, status: FINISHED INFO [2017-05-08 11:28:00,675] ({pool-2-thread-2} SchedulerFactory.java[jobFinished]:137) - Job paragraph_1494075111996_-1250116940 finished by scheduler org.apache.zeppelin.interpreter.remote.RemoteInterpretershared_session2130846287 INFO [2017-05-08 12:26:15,879] ({qtp423031029-60} NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 : 33798. (1001) null INFO [2017-05-08 12:27:12,126] ({Thread-33} AbstractValidatingSessionManager.java[validateSessions]:271) - Validating all active sessions... INFO [2017-05-08 12:27:12,126] ({Thread-33} AbstractValidatingSessionManager.java[validateSessions]:304) - Finished session validation. No sessions were stopped. Hope this helps. Any hints? >> Il giorno 08 mag 2017, alle ore 11:08, Pietro Pugni <pietro.pu...@gmail.com >> <mailto:pietro.pu...@gmail.com>> ha scritto: >> >> I know for sure that R process gets killed (or quits) but don't know if its >> father process (interpreter.sh) gets killed too. >> >> I noticed that I can always restart the interpreter on 0.7.1 while sometimes >> it was impossible to do on 0.7.0 (I had to manually restart zeppelin >> service). Probably that JIRA improved the situation a little bit. >> >> Now I'm running a bash script that tracks start and stop time of R process >> in order to shed some light on this issue. I enabled DEBUG logging in log4j >> properties file. >> >> >> Il 6 mag 2017 4:43 PM, "Paul Brenner" <pbren...@placeiq.com >> <mailto:pbren...@placeiq.com>> ha scritto: >> >> Great work documenting repeatable steps for this hard to nail down problem. >> I see similar problems running the spark (scala) interpreter but haven’t >> been as systematic about hunting down the issue as you. >> >> I do wonder if this is related somehow to >> https://issues.apache.org/jira/browse/ZEPPELIN-1832 >> <https://share.polymail.io/v1/z/b/NTkwZGRlMzNiZmFi/Go00wlomvjABQNciq78PfdeRmR4K6c4M5l8KsTYGlks2sD4oe9jS7NYIkVZ2KKlntmyN0z2ZbiIFSP59SQpYL0hq_V6k3ZjCvIj_gDhLCD5s9K74YEQl1S5xOyCx0TK-xuhEd59t3p3nhZrhs1xXLJxUEM6PoX1EWAcJswdLQj6oNrNLeE-0uF9D4DZjlMlBWs_aYKvi14I21deKenrCDUCPJccm> >> which just seems to have addressed killing off zombie processes but I’m not >> sure it covered where zombie processes are coming from. Perhaps we need to >> open a ticket for this? >> >> In the mean time if you don’t have the ability to restart zeppelin every >> time you run into this process you can probably just kill the interpreter >> process. I find myself doing that multiple times in an normal work day. >> >> <http://www.placeiq.com/> <http://www.placeiq.com/> >> <http://www.placeiq.com/> Paul Brenner >> <https://twitter.com/placeiq> <https://twitter.com/placeiq> >> <https://twitter.com/placeiq> <https://www.facebook.com/PlaceIQ> >> <https://www.facebook.com/PlaceIQ> >> <https://www.linkedin.com/company/placeiq> >> <https://www.linkedin.com/company/placeiq> >> DATA SCIENTIST >> (217) 390-3033 >> >> >> <http://www.placeiq.com/2015/05/26/placeiq-named-winner-of-prestigious-2015-oracle-data-cloud-activate-award/> >> >> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/> >> >> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/> >> >> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/> >> >> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/> >> >> <http://placeiq.com/2016/03/08/measuring-addressable-tv-campaigns-is-now-possible/> >> >> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/> >> >> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/> >> >> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/> >> >> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/> >> >> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/> >> >> <http://pages.placeiq.com/Location-Data-Accuracy-Whitepaper-Download.html?utm_source=Signature&utm_medium=Email&utm_campaign=AccuracyWP> >> >> <http://placeiq.com/2016/08/03/placeiq-bolsters-location-intelligence-platform-with-mastercard-insights/> >> >> <http://placeiq.com/2016/10/26/the-making-of-a-location-data-industry-milestone/> >> >> <http://placeiq.com/2016/12/07/placeiq-introduces-landmark-a-groundbreaking-offering-that-delivers-access-to-the-highest-quality-location-data-for-insights-that-fuel-limitless-business-decisions/> >> >> On Sat, May 06, 2017 at 6:47 AM Pietro Pugni <Pietro Pugni >> <mailto:pietro+pugni+%3cpietro.pu...@gmail.com%3E>> wrote: >> Hi all, >> I am facing a strange issue on two different machines that acts like >> servers. Each of them runs an instance of Zeppelin installed as a system.d >> service. >> The configuration is: >> - Ubuntu Server 16.04.2 LTS >> - Spark 2.1.0 >> - Microsoft Open R 3.3.2 >> - Zeppelin 0.7.1 (0.7.0 gave the same problems) >> >> zeppelin-env.sh has the following settings: >> export SPARK_HOME="/spark/home/directory" >> >> spark-env.sh has the following settings: >> export LANG="en_US" >> export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/some/dir >> -Dspark.eventLog.dir=/some/dir/spark-events -Dhadoop.tmp.dir=/some/dir" >> export _JAVA_OPTIONS+=" -Djava.io.tmpdir=/some/dir" >> >> spark-defaults.conf is set as: >> spark.executor.memory 21g >> spark.driver.memory 21g >> spark.python.worker.memory 4g >> spark.sql.autoBroadcastJoinThreshold 0 >> >> I use Spark in stand-alone mode and it works perfectly. It also works >> correctly with Zeppelin but this is what happens: >> 1) Start zeppelin on the server using the command service zeppelin start >> 2) Connect to port 8080 using Mozilla Firefox from client >> 3) Insert username and password (I enabled Shiro authentication) >> 4) open a notebook >> 5) Execute the following code: >> %spark.r >> 2+2 >> 6) The code runs correctly and I can see that R is currently running as a >> process. >> 7) Repeat steps 2-5 after some time (let’s say 2 or 3 hours) and Zeppelin >> remains forever on “Running” or, if the elapsed time is higher (for example >> 1 day) since the last run, it returns “Error”. The “time-to-be-unresponsive” >> seems to be random and unpredictable. Also, R is not present in the list of >> running processes. Spark session remains active because I can access Spark >> UI from port 4040 and the application name is “Zeppelin”, so it’s the Spark >> instance created by Zeppelin. >> >> I observed that sometimes I can simply restart the interpreter from Zeppelin >> UI, but many other times it doesn’t work and I have to restart Zeppelin ( >> service zeppelin restart ). >> >> This issue afflicts both 0.7.0 and 0.7.1 but I haven’t tried with previous >> versions. It also happens if Zeppelin isn’t installed as a service. >> >> I can’t provide more detail because I can’t see any error or warning in the >> logs.. this is really strange. >> >> Thank you all. >> Kind regards >> Pietro Pugni >> >