Hi all,

Seems like a perf issue in livy server. I assume you are using a recent
version of livy.

If this is he case, may you profile livy server in order to understand
which is the problem?

Thanks,
Marco

On Mon, 8 Jul 2019, 21:03 Kadu Vido, <carlos.v...@lendico.com.br> wrote:

> Hi, I'm working with Hugo in the same project.
>
> Shubham, we're using almost the same setup, only difference is Airflow
> 1.10.1. I coded a workaround in our Livy hook, it has a parameter for
> retries and whenever the session returns anything different from 'idle', we
> try again before failing the task. It's not ideal but at least our
> pipelines aren't stuck anymore.
>
> Zhang, I don't have yarn logs in hand but I can search for them if you'd
> like to take a look. However, our latest clues point a different way:
>
> 1 - running *top* on the master node, we observed that LIvy rapidly takes
> all the available CPUs after we send just a few requests (3 or 4 already
> cause this to happen, if we send upwards of 10, it'll crash the service).
>
> 2. We can get around this spacing them out a bit -- that is, if we use a
> loop to open the sessions and wait ~10s betwen them, it'll give Livy enough
> time to release the CPU resources before trying to open a new one. We've
> had help from some AWS engineers that tried on several instance sizes and
> found out that on larger instances they can try to open 10 or 12
> simultaneously, but:
>
> 3. Regardless of the size of the cluster, we cannot hold more than 9
> simultaneous sessions open. It doesn't matter if our cluster has enough
> vCPUs or RAM to handle more, and the size of the master node doesn't matter
> either: from the 10th session onwards, each one seems to either die or drop.
>
> *Carlos Vido *
>
> Data Engineer @ Lendico Brasil <https://www.lendico.com.br>
>
>
> On Sat, 6 Jul 2019 at 13:30, Shubham Gupta <y2k.shubhamgu...@gmail.com>
> wrote:
>
>> I'm facing precisely same issue.
>> .
>> I've written a LivySessionHook that's just a wrapper over PyLivy Session
>> <https://pylivy.readthedocs.io/en/latest/api/session.html>.
>>
>>    - I'm able to use this hook to send code-snippets to remote EMR via
>>    Python shell a few times, after which it starts throwing "caught
>>    exception 500 Server Error: Internal Server Error for url" (and
>>    continues to do so for next hour or so).
>>    - However when the same hook is triggered via Airflow operator, I get
>>    absolutely no success (always results in 500 error).
>>
>> .
>> I'm using
>>
>>    - Airflow 1.10.3
>>    - Python 3.7.3
>>    - EMR 5.24.1
>>    - Livy 0.6.0
>>    - Spark 2.4.2
>>
>>
>> *Shubham Gupta*
>> Software Engineer
>>  zomato
>>
>>
>> On Sat, Jul 6, 2019 at 6:56 PM Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> For the dead/killed session, could you check the yarn app logs ?
>>>
>>> Hugo Herlanin <hugo.herla...@lendico.com.br> 于2019年7月4日周四 下午9:41写道:
>>>
>>>>
>>>> Hey, user mail is not working out!
>>>>
>>>> I am having some problems with livy setup. My use case is as follows: I
>>>> use a DAG in airflow (1.10) to create a cluster in EMR (5.24.1, one master
>>>> is m4.large and two nodes in m5a.xlarge), and when it is ready,  this dag
>>>> sends 5 to 7 simultaneous requests to Livy. I think I'm not messing with
>>>> the Livy settings, I  just set livy.spark.deploy-mode = client and
>>>> livy.repl.enable-hive-context = true.
>>>>
>>>> The problem is that from these ~ 5 to 7 sessions, just one or two opens
>>>> (goes to 'idle') and all others go straight to 'dead' or 'killed', in logs
>>>> Yarn returns that the sessions were killed by 'livy' user. I tried to
>>>> tinker with all possible timeout settings, but this is still happening. If
>>>> I send more than ~10 simultaneous requests, livy responds with 500, and if
>>>> I continue sending requests, the server freezes. This happens even if EMR
>>>> has enough resources available.
>>>>
>>>> I know the cluster is able to handle that many questions because it
>>>> works when I open them via a loop with an interval of 15 seconds or more,
>>>> but it feels like livy should be able to deal with that many requests
>>>> simultaneously. It seems strange that I should need to manage the queue in
>>>> such a way for an API of a distributed system.
>>>>
>>>> Do you have any clue about where I might be doing wrong? Is there any
>>>> known limitation that I'm unaware of?
>>>>
>>>> Best,
>>>>
>>>> Hugo Herlanin
>>>>
>>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>

Reply via email to