Hi all, Seems like a perf issue in livy server. I assume you are using a recent version of livy.
If this is he case, may you profile livy server in order to understand which is the problem? Thanks, Marco On Mon, 8 Jul 2019, 21:03 Kadu Vido, <carlos.v...@lendico.com.br> wrote: > Hi, I'm working with Hugo in the same project. > > Shubham, we're using almost the same setup, only difference is Airflow > 1.10.1. I coded a workaround in our Livy hook, it has a parameter for > retries and whenever the session returns anything different from 'idle', we > try again before failing the task. It's not ideal but at least our > pipelines aren't stuck anymore. > > Zhang, I don't have yarn logs in hand but I can search for them if you'd > like to take a look. However, our latest clues point a different way: > > 1 - running *top* on the master node, we observed that LIvy rapidly takes > all the available CPUs after we send just a few requests (3 or 4 already > cause this to happen, if we send upwards of 10, it'll crash the service). > > 2. We can get around this spacing them out a bit -- that is, if we use a > loop to open the sessions and wait ~10s betwen them, it'll give Livy enough > time to release the CPU resources before trying to open a new one. We've > had help from some AWS engineers that tried on several instance sizes and > found out that on larger instances they can try to open 10 or 12 > simultaneously, but: > > 3. Regardless of the size of the cluster, we cannot hold more than 9 > simultaneous sessions open. It doesn't matter if our cluster has enough > vCPUs or RAM to handle more, and the size of the master node doesn't matter > either: from the 10th session onwards, each one seems to either die or drop. > > *Carlos Vido * > > Data Engineer @ Lendico Brasil <https://www.lendico.com.br> > > > On Sat, 6 Jul 2019 at 13:30, Shubham Gupta <y2k.shubhamgu...@gmail.com> > wrote: > >> I'm facing precisely same issue. >> . >> I've written a LivySessionHook that's just a wrapper over PyLivy Session >> <https://pylivy.readthedocs.io/en/latest/api/session.html>. >> >> - I'm able to use this hook to send code-snippets to remote EMR via >> Python shell a few times, after which it starts throwing "caught >> exception 500 Server Error: Internal Server Error for url" (and >> continues to do so for next hour or so). >> - However when the same hook is triggered via Airflow operator, I get >> absolutely no success (always results in 500 error). >> >> . >> I'm using >> >> - Airflow 1.10.3 >> - Python 3.7.3 >> - EMR 5.24.1 >> - Livy 0.6.0 >> - Spark 2.4.2 >> >> >> *Shubham Gupta* >> Software Engineer >> zomato >> >> >> On Sat, Jul 6, 2019 at 6:56 PM Jeff Zhang <zjf...@gmail.com> wrote: >> >>> For the dead/killed session, could you check the yarn app logs ? >>> >>> Hugo Herlanin <hugo.herla...@lendico.com.br> 于2019年7月4日周四 下午9:41写道: >>> >>>> >>>> Hey, user mail is not working out! >>>> >>>> I am having some problems with livy setup. My use case is as follows: I >>>> use a DAG in airflow (1.10) to create a cluster in EMR (5.24.1, one master >>>> is m4.large and two nodes in m5a.xlarge), and when it is ready, this dag >>>> sends 5 to 7 simultaneous requests to Livy. I think I'm not messing with >>>> the Livy settings, I just set livy.spark.deploy-mode = client and >>>> livy.repl.enable-hive-context = true. >>>> >>>> The problem is that from these ~ 5 to 7 sessions, just one or two opens >>>> (goes to 'idle') and all others go straight to 'dead' or 'killed', in logs >>>> Yarn returns that the sessions were killed by 'livy' user. I tried to >>>> tinker with all possible timeout settings, but this is still happening. If >>>> I send more than ~10 simultaneous requests, livy responds with 500, and if >>>> I continue sending requests, the server freezes. This happens even if EMR >>>> has enough resources available. >>>> >>>> I know the cluster is able to handle that many questions because it >>>> works when I open them via a loop with an interval of 15 seconds or more, >>>> but it feels like livy should be able to deal with that many requests >>>> simultaneously. It seems strange that I should need to manage the queue in >>>> such a way for an API of a distributed system. >>>> >>>> Do you have any clue about where I might be doing wrong? Is there any >>>> known limitation that I'm unaware of? >>>> >>>> Best, >>>> >>>> Hugo Herlanin >>>> >>>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >>