Just to explain why I think threading is not the root cause: Simply because
it should not matter how many threads you have in your processor and there
is no direct cause-effect link between decreasing number of threads and
showing logs. The worst that might happen if you have not enough CPU power
to run many threads in parallel - all things will just run slower as the
threads will preempt each other periodically.

The article you mention is about not having enough memory, not threads. So
maybe the root cause is that you simply use too much memory when you have
more threads and then you hit the issue? Maybe try to increase memory and
see if it still happens. Also setting monitoring on your machine to observe
resource usage and see if the memory is a problem should help to diagnose
the problem. The hypothesis about memory being the root cause is quite
plausible, but it's not directly related to the number of threads you have
in the processor/cores. It' linked to memory available (which might be
correlated as often the more threads you have, the more memory you have).

J.

On Thu, Dec 19, 2019 at 9:05 PM Reed Villanueva <[email protected]>
wrote:

> I am using postgres as the backend (my settings related to this are at the
> bottom of this message).
> Also there are no logs. Not in the webserver UI and not in the folder /
> path indicated in the task details info for those tasks that failed without
> logs (ie. those log_filepath values do not actually exist on the machine),
> which is why the problem was so odd to me.
> Could you explain a bit more about why you doubt that the threading issue
> was the problem?
> The docs here (
> https://cloud.google.com/composer/docs/how-to/using/troubleshooting-dags#task_fails_without_emitting_logs)
> are what initially made me think to take a second look at my machine specs
> vs airflow.cfg concurrency settings.
>
> postgresql settings (based on following this guide):
>
> [airflow@airflowetl airflow]$ rpm -q postgresql-server postgresql 
> postgresql-devel
> postgresql-server-9.2.24-1.el7_5.x86_64
> postgresql-9.2.24-1.el7_5.x86_64
> postgresql-devel-9.2.24-1.el7_5.x86_64
>
>
> [airflow@airflowetl airflow]$ pip3 freeze | grep sqlalchemy
> marshmallow-sqlalchemy==0.19.0
> [airflow@airflowetl airflow]$ pip3 freeze | grep psycopg2
> psycopg2==2.8.4
>
>
>
> [airflow@airflowetl airflow]$ psql airflow
> psql (9.2.24)
> Type "help" for help.
>
> airflow=> \du
>                              List of roles
>  Role name |                   Attributes                   | Member of
> -----------+------------------------------------------------+-----------
>  airflow   |                                                | {}
>  postgres  | Superuser, Create role, Create DB, Replication | {}
>
> airflow-> \l
>                                   List of databases
>    Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access 
> privileges
> -----------+----------+----------+-------------+-------------+-----------------------
>  airflow   | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =Tc/postgres   
>       +
>            |          |          |             |             | 
> postgres=CTc/postgres+
>            |          |          |             |             | 
> airflow=CTc/postgres
>  postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
>  template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres    
>       +
>            |          |          |             |             | 
> postgres=CTc/postgres
>  template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres    
>       +
>            |          |          |             |             | 
> postgres=CTc/postgres
>
> airflow=> \c airflow
> You are now connected to database "airflow" as user "airflow".
>
> airflow=> \dt
> No relations found.
>
> airflow=> \conninfo
> You are connected to database "airflow" as user "airflow" via socket in 
> "/var/run/postgresql" at port "5432".
>
>
>
> [root@airflowetl airflow]# cat /var/lib/pgsql/data/pg_hba.conf
> ....# TYPE  DATABASE        USER            ADDRESS                 METHOD
> # "local" is for Unix domain socket connections onlylocal   all             
> all                                     peer# IPv4 local connections:#host    
> all             all             127.0.0.1/32            identhost    all      
>        all             0.0.0.0/0               md5# IPv6 local connections:
> host    all             all             ::1/128                 ident# Allow 
> replication connections from localhost, by a user with the# replication 
> privilege.#local   replication     postgres                                
> peer#host    replication     postgres        127.0.0.1/32            
> ident#host    replication     postgres        ::1/128                 ident
>
>
>
> [root@airflowetl airflow]# cat /var/lib/pgsql/data/postgresql.conf
> ....# — — — — — — — — — — — — — — — — — — — — — — — — — — — — — # CONNECTIONS 
> AND AUTHENTICATION# — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
> # — Connection Settings -#listen_addresses = ‘localhost’ # what IP 
> address(es) to listen on;
> listen_addresses = ‘*’ # for Airflow connection
>
>
>
> [airflow@airflowetl airflow]$ cat airflow.cfg
> ....
> [core]
> ....# The executor class that airflow should use. Choices include# 
> SequentialExecutor, LocalExecutor, CeleryExecutor, DaskExecutor, 
> KubernetesExecutor#executor = SequentialExecutor
> executor = LocalExecutor
> # The SqlAlchemy connection string to the metadata database.# SqlAlchemy 
> supports many different database engine, more information# their 
> website#sql_alchemy_conn = sqlite:////home/airflow/airflow/airflow.db# if use 
> localhost instead of 127.0.0.1, posgres will use IPv6
> sql_alchemy_conn = 
> postgresql+psycopg2://airflow:[email protected]:5432/airflow
>
>
>
> On Wed, Dec 18, 2019 at 10:36 PM Jarek Potiuk <[email protected]>
> wrote:
>
>> I seriously doubt it's the problem. There should be dag/tasks logs in
>> your logs folder as well and they should tell you what happened. What
>> database are you using ? Can you please dig deeper and provide more logs?
>>
>> J,
>>
>> On Thu, Dec 19, 2019 at 1:28 AM Reed Villanueva <[email protected]>
>> wrote:
>>
>>> Looking again at my lscpu specs, I noticed...
>>>
>>> [airflow@airflowetl airflow]$ lscpuArchitecture:          x86_64
>>> CPU op-mode(s):        32-bit, 64-bitByte Order:            Little Endian
>>> CPU(s):                8On-line CPU(s) list:   0-7Thread(s) per core:    
>>> 1Core(s) per socket:    4Socket(s):             2
>>>
>>> Notice Thread(s) per core: 1
>>>
>>> Looking at my airflow.cfg settings I see max_threads = 2. Setting 
>>> max_threads
>>> = 1 and restarting both the scheduler
>>> <https://www.astronomer.io/guides/airflow-scaling-workers/> seems to
>>> have fixed the problem.
>>>
>>> If anyone knows more about what exactly is going wrong under the hood
>>> (eg. why the task fails rather than just waiting for another thread to
>>> become available), would be interested to hear about it.
>>>
>>> On Wed, Dec 18, 2019 at 11:45 AM Reed Villanueva <[email protected]>
>>> wrote:
>>>
>>>> Running airflow dag that ran fine with SequentialExecutor now has many
>>>> (though not all) simple tasks that fail without any log information when
>>>> running with LocalExecutor and minimal parallelism, eg.
>>>>
>>>> <airflow.cfg>
>>>> # overall task concurrency limit for airflow
>>>> parallelism = 8 # which is same as number of cores shown by lscpu# max 
>>>> tasks per dag
>>>> dag_concurrency = 2# max instances of a given dag that can run on airflow
>>>> max_active_runs_per_dag = 1# max threads used per worker / core
>>>> max_threads = 2
>>>>
>>>> see https://www.astronomer.io/guides/airflow-scaling-workers/
>>>>
>>>> Looking at the airflow-webserver.* logs nothing looks out of the
>>>> ordinary, but looking at airflow-scheduler.out I see...
>>>>
>>>> [airflow@airflowetl airflow]$ tail -n 20 
>>>> airflow-scheduler.out....[2019-12-18 11:29:17,773] {scheduler_job.py:1283} 
>>>> INFO - Executor reports execution of mydag.task_level1_table1 
>>>> execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed 
>>>> for try_number 1[2019-12-18 11:29:17,779] {scheduler_job.py:1283} INFO - 
>>>> Executor reports execution of mydag.task_level1_table2 
>>>> execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed 
>>>> for try_number 1[2019-12-18 11:29:17,782] {scheduler_job.py:1283} INFO - 
>>>> Executor reports execution of mydag.task_level1_table3 
>>>> execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed 
>>>> for try_number 1[2019-12-18 11:29:18,833] {scheduler_job.py:832} WARNING - 
>>>> Set 1 task instances to state=None as their associated DagRun was not in 
>>>> RUNNING state[2019-12-18 11:29:18,844] {scheduler_job.py:1283} INFO - 
>>>> Executor reports execution of mydag.task_level1_table4 
>>>> execution_date=2019-12-18 21:21:48.424900+00:00 exited with status success 
>>>> for try_number 1....
>>>>
>>>> but not really sure what to take away from this.
>>>>
>>>> Anyone know what could be going on here or how to get more helpful
>>>> debugging info?
>>>>
>>>
>>> This electronic message is intended only for the named
>>> recipient, and may contain information that is confidential or
>>> privileged. If you are not the intended recipient, you are
>>> hereby notified that any disclosure, copying, distribution or
>>> use of the contents of this message is strictly prohibited. If
>>> you have received this message in error or are not the named
>>> recipient, please notify us immediately by contacting the
>>> sender at the electronic mail address noted above, and delete
>>> and destroy all copies of this message. Thank you.
>>>
>>
>>
>> --
>>
>> Jarek Potiuk
>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>
>> M: +48 660 796 129 <+48660796129>
>> [image: Polidea] <https://www.polidea.com/>
>>
>>
> This electronic message is intended only for the named
> recipient, and may contain information that is confidential or
> privileged. If you are not the intended recipient, you are
> hereby notified that any disclosure, copying, distribution or
> use of the contents of this message is strictly prohibited. If
> you have received this message in error or are not the named
> recipient, please notify us immediately by contacting the
> sender at the electronic mail address noted above, and delete
> and destroy all copies of this message. Thank you.
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to