Seem to have found the problem. Had a piece of code like...
hadoop fs -Dfs.mapr.trace=debug -get \
ftp://$FTP_CLIENT:$FTP_PASS@$FTP_IP/$FTP_DIR"$TABLENAME.TSV" \
$PROJECT_HOME/tmp/"$TABLENAME.TSV" \
| hadoop fs -moveFromLocal $PROJECT_HOME/tmp/"$TABLENAME.TSV"
"$DATASTORE"
changed to
hadoop fs -Dfs.mapr.trace=debug -get \
ftp://$FTP_CLIENT:$FTP_PASS@$FTP_IP/$FTP_DIR"$TABLENAME.TSV" \
$PROJECT_HOME/tmp/"$TABLENAME.TSV"
hadoop fs -moveFromLocal $PROJECT_HOME/tmp/"$TABLENAME.TSV" "$DATASTORE"
IDK why, but there seems to have been a problem with using the pipe in this
way, not sure why, but I *suspect* that it has something to do with latency
issues when having to read from the local temp dir before writing to
-moveFromLocal (because would get similar 'file not found' errors when
running the commands manually in the shell when chained together with a
pipe).
Also not sure why any of this would cause the airflow scheduler to have
problems, but have run the task several times now since the change and have
not seen the scheduler error again, so not sure what to make of that.
If anyone can explain any of this weirdness please do let me know to make
this answer a bit more complete. Will continue to debug and update.
On Mon, Dec 9, 2019 at 12:17 PM Reed Villanueva <[email protected]>
wrote:
> I see thanks.
>
> Though have already tried manually restarting the scheduler, but still
> seeing the same error (ie. deleting all airflow-scheduler.* files and
> killing the scheduler -D process then running it again), so not super sure
> how setting an automated restart would help.
>
> On Mon, Dec 9, 2019 at 12:09 PM Aaron Grubb <[email protected]>
> wrote:
>
>> I should have been more specific. I meant set it to something low to test
>> that restarting the scheduler fixes that problem, something like an hour,
>> then if it does, increase it to the 24 hours recommended.
>>
>>
>>
>> *From:* Reed Villanueva <[email protected]>
>> *Sent:* Monday, December 9, 2019 4:59 PM
>> *To:* [email protected]
>> *Subject:* Re: Airflow scheduler complains no heartbeat when running
>> daemon
>>
>>
>>
>> Aaron,
>>
>> Pretty new to airflow as well and curious about what your thinking is
>> where setting scheduler.run_duration to something very low would be helpful
>> here. To me it seems odd to have the scheduler restarting every, say, 30
>> seconds (also not sure how this will affect the airflow jobs that need to
>> run throughout the day). From this article (
>> https://www.astronomer.io/blog/7-common-errors-to-check-when-debugging-airflow-dag/),
>> once every 24 hours seems to be recommended.
>>
>>
>>
>> On Mon, Dec 9, 2019 at 11:14 AM Aaron Grubb <[email protected]>
>> wrote:
>>
>> Don’t take this at face value since I’m a novice with Airflow but my
>> understanding of best practices is to have the scheduler restart every so
>> often (cmd line: -r <seconds> or config: scheduler.run_duration =
>> <seconds>) Kill all the processes and try setting that to something low,
>> then if the problem goes away, increase it to a day or something.
>>
>>
>>
>> *From:* Reed Villanueva <[email protected]>
>> *Sent:* Monday, December 9, 2019 3:48 PM
>> *To:* [email protected]
>> *Subject:* Airflow scheduler complains no heartbeat when running daemon
>>
>>
>>
>> Have problem where the airflow (v1.10.5) webserver will complain...
>>
>> The scheduler does not appear to be running. Last heartbeat was received
>> 45 minutes ago.
>>
>> But checking the scheduler daemon process (started via airflow scheduler
>> -D) can see...
>>
>> [airflow@airflowetl airflow]$ cat airflow-scheduler.pid
>>
>> 64186
>>
>> [airflow@airflowetl airflow]$ ps -aux | grep 64186
>>
>> airflow 64186 0.0 0.1 663340 67796 ? S 15:03 0:00
>> /usr/bin/python3 /home/airflow/.local/bin/airflow scheduler -D
>>
>> airflow 94305 0.0 0.0 112716 964 pts/4 R+ 16:01 0:00 grep
>> --color=auto 64186
>>
>> and after some period of time the error message *goes away again*).
>>
>> This happens very frequently off-and-on even after restarting both the
>> webserver and scheduler.
>>
>> The airflow-scheduler.err file is empty and the .out and .log files
>> appear innocuous (need more time to look through deeper).
>>
>> Running the scheduler in the terminal to see the feed live, everything
>> seems to run fine until I see this output in the middle of the dag execution
>>
>> [2019-11-29 15:51:57,825] {__init__.py:51} INFO - Using executor
>> SequentialExecutor
>>
>> [2019-11-29 15:51:58,259] {dagbag.py:90} INFO - Filling up the DagBag from
>> /home/airflow/airflow/dags/my_dag_file.py
>>
>> Once this pops up, I can see in the web UI that the scheduler heartbeat
>> error message appears. (Oddly, killing the scheduler process here does not
>> generate the heartbeat error message in the web UI). Checking for the
>> scheduler process, I see...
>>
>> [airflow@airflowetl airflow]$ ps -aux | grep scheduler
>>
>> airflow 3409 0.2 0.1 523336 67384 ? S Oct24 115:06 airflow
>> scheduler -- DagFileProcessorManager
>>
>> airflow 25569 0.0 0.0 112716 968 pts/4 S+ 16:00 0:00 grep
>> --color=auto scheduler
>>
>> airflow 56771 0.0 0.1 662560 67264 ? S Nov26 4:09 airflow
>> scheduler -- DagFileProcessorManager
>>
>> airflow 64187 0.0 0.1 662564 67096 ? S Nov27 0:00 airflow
>> scheduler -- DagFileProcessorManager
>>
>> airflow 153959 0.1 0.1 662568 67232 ? S 15:01 0:06 airflow
>> scheduler -- DagFileProcessorManager
>>
>> IDK if this is this normal or not.
>>
>> Thought the problem may have been that there were older scheduler
>> processes that were not deleted that were still running...
>>
>> [airflow@airflowetl airflow]$ kill -9 3409 36771
>>
>> bash: kill: (36771) - No such process
>>
>> [airflow@airflowetl airflow]$ ps -aux | grep scheduler
>>
>> airflow 56771 0.0 0.1 662560 67264 ? S Nov26 4:09 airflow
>> scheduler -- DagFileProcessorManager
>>
>> airflow 64187 0.0 0.1 662564 67096 ? S Nov27 0:00 airflow
>> scheduler -- DagFileProcessorManager
>>
>> airflow 153959 0.0 0.1 662568 67232 ? S Nov29 0:06 airflow
>> scheduler -- DagFileProcessorManager
>>
>> airflow 155741 0.0 0.0 112712 968 pts/2 R+ 15:54 0:00 grep
>> --color=auto scheduler
>>
>> Notice all the various start times in the output.
>>
>> Doing a kill -9 56771 64187 ... and then rerunning airflow scheduler -D does
>> not seem to have fixed the problem.
>>
>> Note: the scheduler seems to consistently stop running after a task fails
>> to move a file from an FTP location to an HDFS one...
>>
>> hadoop fs -Dfs.mapr.trace=debug -get \
>>
>> ftp://$FTP_CLIENT:$FTP_PASS@$FTP_IP/$FTP_DIR"$TABLENAME.TSV" \
>>
>> $PROJECT_HOME/tmp/"$TABLENAME.TSV" \
>>
>> | hadoop fs -moveFromLocal $PROJECT_HOME/tmp/"$TABLENAME.TSV"
>> "$DATASTORE"
>>
>> # see https://stackoverflow.com/a/46433847/8236733
>>
>> Note there *is* a logic error in this line since $DATASTORE is a hdfs
>> dir path, not a file path, but either way I don't think that the airflow
>> scheduler should be missing heartbeats like this from something seemingly
>> so unrelated.
>>
>> Anyone know what could be going on here or how to fix?
>>
>>
>> This electronic message is intended only for the named
>> recipient, and may contain information that is confidential or
>> privileged. If you are not the intended recipient, you are
>> hereby notified that any disclosure, copying, distribution or
>> use of the contents of this message is strictly prohibited. If
>> you have received this message in error or are not the named
>> recipient, please notify us immediately by contacting the
>> sender at the electronic mail address noted above, and delete
>> and destroy all copies of this message. Thank you.
>>
>>
>> This electronic message is intended only for the named
>> recipient, and may contain information that is confidential or
>> privileged. If you are not the intended recipient, you are
>> hereby notified that any disclosure, copying, distribution or
>> use of the contents of this message is strictly prohibited. If
>> you have received this message in error or are not the named
>> recipient, please notify us immediately by contacting the
>> sender at the electronic mail address noted above, and delete
>> and destroy all copies of this message. Thank you.
>>
>
--
This electronic message is intended only for the named
recipient, and may
contain information that is confidential or
privileged. If you are not the
intended recipient, you are
hereby notified that any disclosure, copying,
distribution or
use of the contents of this message is strictly
prohibited. If
you have received this message in error or are not the
named
recipient, please notify us immediately by contacting the
sender at
the electronic mail address noted above, and delete
and destroy all copies
of this message. Thank you.