Re: spark.r interpreter becomes unresponsive after some time and R process quits silently

Pietro Pugni Mon, 08 May 2017 16:30:07 -0700

I repost this because it didn’t appear on the mailing list board.

These are the step needed to reproduce the error and to track down the log 
message.


1) I started a brand new instance of zeppelin issuing:
service zeppelin start

and started a bash script that tracks down R processes activity.
After running a simple R script from Zeppelin, the R interpreter process was 
started:

Mon May  8 11:27:59 CEST 2017 >>> R started

2) I left the browser open and at 12:26:15 I closed the browser. Zeppelin 
tracked down the connection being closed:
INFO [2017-05-08 12:26:15,879] ({qtp423031029-60} 
NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 : 33798. 
(1001) null

3) At 13:08:00 R was closed. My script returned:
Mon May  8 13:08:00 CEST 2017 >>> R stopped

This is the output from the interpreter log file (deleted non-useful lines):
INFO [2017-05-08 11:27:43,632] ({Thread-0} 
RemoteInterpreterServer.java[run]:95) - Starting remote interpreter server on 
port 45227
INFO [2017-05-08 11:27:44,600] ({pool-1-thread-3} 
RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter 
org.apache.zeppelin.spark.SparkInterpreter
INFO [2017-05-08 11:27:44,624] ({pool-1-thread-3} 
RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter 
org.apache.zeppelin.spark.SparkSqlInterpreter
INFO [2017-05-08 11:27:44,629] ({pool-1-thread-3} 
RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter 
org.apache.zeppelin.spark.DepInterpreter
INFO [2017-05-08 11:27:44,640] ({pool-1-thread-3} 
RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter 
org.apache.zeppelin.spark.PySparkInterpreter
INFO [2017-05-08 11:27:44,643] ({pool-1-thread-3} 
RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter 
org.apache.zeppelin.spark.SparkRInterpreter
...
INFO [2017-05-08 11:28:00,188] ({pool-2-thread-2} 
SchedulerFactory.java[jobFinished]:137) - Job remoteInterpretJob_1494235664723 
finished by scheduler org.apache.zeppelin.spark.SparkRInterpreter2097894179
DEBUG [2017-05-08 11:28:00,819] ({pool-1-thread-3} 
RemoteInterpreterServer.java[resourcePoolGetAll]:911) - Request getAll from 
ZeppelinServer
DEBUG [2017-05-08 13:08:00,187] ({Exec Stream Pumper} 
InterpreterOutputStream.java[processLine]:72) - Interpreter output:Error in 
handleErrors(returnStatus, conn) : 
DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} 
InterpreterOutputStream.java[processLine]:72) - Interpreter output:  No status 
is returned. Java SparkR backend might have failed.
DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} 
InterpreterOutputStream.java[processLine]:72) - Interpreter output:Calls: 
<Anonymous> -> invokeJava -> handleErrors
DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} 
InterpreterOutputStream.java[processLine]:72) - Interpreter output:Execution 
halted

This is the output from zeppelin log file (it didn't track the R interpreter 
failure):
INFO [2017-05-08 11:28:00,221] ({pool-2-thread-2} 
NotebookServer.java[afterStatusChange]:2056) - Job 20170506-145151_1585482989 
is finished successfully, status: FINISHED
INFO [2017-05-08 11:28:00,675] ({pool-2-thread-2} 
SchedulerFactory.java[jobFinished]:137) - Job 
paragraph_1494075111996_-1250116940 finished by scheduler 
org.apache.zeppelin.interpreter.remote.RemoteInterpretershared_session2130846287
INFO [2017-05-08 12:26:15,879] ({qtp423031029-60} 
NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 : 33798. 
(1001) null
INFO [2017-05-08 12:27:12,126] ({Thread-33} 
AbstractValidatingSessionManager.java[validateSessions]:271) - Validating all 
active sessions...
INFO [2017-05-08 12:27:12,126] ({Thread-33} 
AbstractValidatingSessionManager.java[validateSessions]:304) - Finished session 
validation.  No sessions were stopped.

Hope this helps. 
Any hints?

>> Il giorno 08 mag 2017, alle ore 11:08, Pietro Pugni <pietro.pu...@gmail.com 
>> <mailto:pietro.pu...@gmail.com>> ha scritto:
>> 
>> I know for sure that R process gets killed (or quits) but don't know if its 
>> father process (interpreter.sh) gets killed too.
>> 
>> I noticed that I can always restart the interpreter on 0.7.1 while sometimes 
>> it was impossible to do on 0.7.0 (I had to manually restart zeppelin 
>> service). Probably that JIRA improved the situation a little bit.
>> 
>> Now I'm running a bash script that tracks start and stop time of R process 
>> in order to shed some light on this issue. I enabled DEBUG logging in log4j 
>> properties file.
>> 
>> 
>> Il 6 mag 2017 4:43 PM, "Paul Brenner" <pbren...@placeiq.com 
>> <mailto:pbren...@placeiq.com>> ha scritto:
>> 
>> Great work documenting repeatable steps for this hard to nail down problem. 
>> I see similar problems running the spark (scala) interpreter but haven’t 
>> been as systematic about hunting down the issue as you. 
>> 
>> I do wonder if this is related somehow to 
>> https://issues.apache.org/jira/browse/ZEPPELIN-1832 
>> <https://share.polymail.io/v1/z/b/NTkwZGRlMzNiZmFi/Go00wlomvjABQNciq78PfdeRmR4K6c4M5l8KsTYGlks2sD4oe9jS7NYIkVZ2KKlntmyN0z2ZbiIFSP59SQpYL0hq_V6k3ZjCvIj_gDhLCD5s9K74YEQl1S5xOyCx0TK-xuhEd59t3p3nhZrhs1xXLJxUEM6PoX1EWAcJswdLQj6oNrNLeE-0uF9D4DZjlMlBWs_aYKvi14I21deKenrCDUCPJccm>
>> which just seems to have addressed killing off zombie processes but I’m not 
>> sure it covered where zombie processes are coming from. Perhaps we need to 
>> open a ticket for this?
>> 
>> In the mean time if you don’t have the ability to restart zeppelin every 
>> time you run into this process you can probably just kill the interpreter 
>> process. I find myself doing that multiple times in an normal work day.
>> 
>>  <http://www.placeiq.com/> <http://www.placeiq.com/> 
>> <http://www.placeiq.com/>       Paul Brenner     
>> <https://twitter.com/placeiq> <https://twitter.com/placeiq> 
>> <https://twitter.com/placeiq>       <https://www.facebook.com/PlaceIQ> 
>> <https://www.facebook.com/PlaceIQ>   
>> <https://www.linkedin.com/company/placeiq> 
>> <https://www.linkedin.com/company/placeiq>
>> DATA SCIENTIST
>> (217) 390-3033  
>> 
>>  
>> <http://www.placeiq.com/2015/05/26/placeiq-named-winner-of-prestigious-2015-oracle-data-cloud-activate-award/>
>>  
>> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/>
>>  
>> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/>
>>  
>> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/>
>>  
>> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/>
>>  
>> <http://placeiq.com/2016/03/08/measuring-addressable-tv-campaigns-is-now-possible/>
>>  
>> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
>>  
>> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
>>  
>> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
>>  
>> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
>>  
>> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/>
>>  
>> <http://pages.placeiq.com/Location-Data-Accuracy-Whitepaper-Download.html?utm_source=Signature&utm_medium=Email&utm_campaign=AccuracyWP>
>>  
>> <http://placeiq.com/2016/08/03/placeiq-bolsters-location-intelligence-platform-with-mastercard-insights/>
>>  
>> <http://placeiq.com/2016/10/26/the-making-of-a-location-data-industry-milestone/>
>>  
>> <http://placeiq.com/2016/12/07/placeiq-introduces-landmark-a-groundbreaking-offering-that-delivers-access-to-the-highest-quality-location-data-for-insights-that-fuel-limitless-business-decisions/>
>> 
>> On Sat, May 06, 2017 at 6:47 AM Pietro Pugni <Pietro Pugni  
>> <mailto:pietro+pugni+%3cpietro.pu...@gmail.com%3E>> wrote:
>> Hi all,
>> I am facing a strange issue on two different machines that acts like 
>> servers. Each of them runs an instance of Zeppelin installed as a system.d 
>> service.
>> The configuration is:
>>  - Ubuntu Server 16.04.2 LTS
>>  - Spark 2.1.0
>>  - Microsoft Open R 3.3.2
>>  - Zeppelin 0.7.1 (0.7.0 gave the same problems)
>> 
>> zeppelin-env.sh has the following settings:
>> export SPARK_HOME="/spark/home/directory"
>> 
>> spark-env.sh has the following settings:
>> export LANG="en_US"
>> export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/some/dir 
>> -Dspark.eventLog.dir=/some/dir/spark-events -Dhadoop.tmp.dir=/some/dir"
>> export _JAVA_OPTIONS+=" -Djava.io.tmpdir=/some/dir"
>> 
>> spark-defaults.conf is set as:
>> spark.executor.memory                21g
>> spark.driver.memory                     21g
>> spark.python.worker.memory           4g
>> spark.sql.autoBroadcastJoinThreshold    0
>> 
>> I use Spark in stand-alone mode and it works perfectly. It also works 
>> correctly with Zeppelin but this is what happens:
>> 1) Start zeppelin on the server using the command service zeppelin start
>> 2) Connect to port 8080 using Mozilla Firefox from client 
>> 3) Insert username and password (I enabled Shiro authentication)
>> 4) open a notebook
>> 5) Execute the following code:
>> %spark.r
>> 2+2
>> 6) The code runs correctly and I can see that R is currently running as a 
>> process.
>> 7) Repeat steps 2-5 after some time (let’s say 2 or 3 hours) and Zeppelin 
>> remains forever on “Running” or, if the elapsed time is higher (for example 
>> 1 day) since the last run, it returns “Error”. The “time-to-be-unresponsive” 
>> seems to be random and unpredictable. Also, R is not present in the list of 
>> running processes. Spark session remains active because I can access Spark 
>> UI from port 4040 and the application name is “Zeppelin”, so it’s the Spark 
>> instance created by Zeppelin.
>> 
>> I observed that sometimes I can simply restart the interpreter from Zeppelin 
>> UI, but many other times it doesn’t work and I have to restart Zeppelin ( 
>> service zeppelin restart ).
>> 
>> This issue afflicts both 0.7.0 and 0.7.1 but I haven’t tried with previous 
>> versions. It also happens if Zeppelin isn’t installed as a service.
>> 
>> I can’t provide more detail because I can’t see any error or warning in the 
>> logs.. this is really strange. 
>> 
>> Thank you all.
>> Kind regards
>>  Pietro Pugni
>> 
>

Re: spark.r interpreter becomes unresponsive after some time and R process quits silently

Reply via email to