url54 opened a new issue, #45718:
URL: https://github.com/apache/airflow/issues/45718

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### If "Other Airflow 2 version" selected, which one?
   
   2.10.3
   
   ### What happened?
   
   Dags Processing Manager attempts to parse PPTX files causing it to place 
corrupted data in the metadata database.
   
   ### What you think should happen instead?
   
   Ideally it should not be parsing any none-Python files from the 
/usr/local/airflow/dags directory.
   
   ### How to reproduce
   
   Copy text from an Airflow DAG code and place it inside a PPTX slide. Upload 
the PPTX to `/usr/local/airflow/dags`, wait for errors to appear in dags 
processing manager logs (may take little bit of time), for reference I have 
attached what I used and screenshot:
   
   
![Image](https://github.com/user-attachments/assets/4e7413a1-fb67-4ad9-8cc2-38e230f5fb35)
   
   [Testing AIrflow 
Bug.pptx](https://github.com/user-attachments/files/18444241/Testing.AIrflow.Bug.pptx)
   
   ### Operating System
   
   Amazon Linux 2023
   
   ### Versions of Apache Airflow Providers
   
   Connection type | Package|
   |--|--| 
   |AWS Connection 
|[apache-airflow-providers-amazon[aiobotocore]==9.0.0](https://airflow.apache.org/docs/apache-airflow-providers-amazon/9.0.0/index.html)|
   | Postgres 
Connection|[apache-airflow-providers-postgres==5.13.1](https://airflow.apache.org/docs/apache-airflow-providers-postgres/5.13.1/index.html)|
 
   |FTP 
Connection|[apache-airflow-providers-ftp==3.11.1](https://airflow.apache.org/docs/apache-airflow-providers-ftp/3.11.1/index.html)|
   |Fab 
Connection|[apache-airflow-providers-fab==1.5.0](https://airflow.apache.org/docs/apache-airflow-providers-fab/1.5.0/index.html)|
   |Celery 
Connection|[apache-airflow-providers-celery==3.8.3](https://airflow.apache.org/docs/apache-airflow-providers-celery/3.8.3/index.html)|
   |HTTP 
Connection|[apache-airflow-providers-http==4.13.2](https://airflow.apache.org/docs/apache-airflow-providers-http/4.13.2/index.html)|
   |IMAP 
Connection|[apache-airflow-providers-imap==3.7.0](https://airflow.apache.org/docs/apache-airflow-providers-imap/3.7.0/index.html)|
   |Common 
SQL|[apache-airflow-providers-common-sql==1.19.0](https://airflow.apache.org/docs/apache-airflow-providers-common-sql/1.19.0/index.html)|
   |SQLite 
Connection|[apache-airflow-providers-sqlite==3.9.0](https://airflow.apache.org/docs/apache-airflow-providers-sqlite/3.9.0/index.html)|
   |SMTP 
Connection|[apache-airflow-providers-smtp==1.8.0](https://airflow.apache.org/docs/apache-airflow-providers-smtp/1.8.0/index.html)|
   
   ### Deployment
   
   Amazon (AWS) MWAA
   
   ### Deployment details
   
   Nothing special, default MWAA deployment no additional configurations or 
requirements.
   
   ### Anything else?
   
   Based on how zipfile works --> 
https://github.com/apache/airflow/blob/main/airflow/utils/file.py#L276
   
   It appears the list of items that could be parsed extends beyond PPTX 
[[1]](https://en.wikipedia.org/wiki/List_of_file_signatures)
   
   | Hex signature | ISO 8859-1 | Offset | Extension | Description |
   |--|--|--|--|--|
   |50 4B 03 04, 50 4B 05 06 (empty archive), 50 4B 07 08 (spanned 
archive)|PK␃␄,PK␅␆,PK␇␈      |0|     zip, aar, apk, docx, epub, 
[ipa](https://en.wikipedia.org/wiki/.ipa), jar, kmz, 
[maff](https://en.wikipedia.org/wiki/Mozilla_Archive_Format), msix, odp, ods, 
odt, pk3, pk4, pptx, usdz, vsdx, xlsx, 
[xpi](https://en.wikipedia.org/wiki/XPInstall) |[zip file 
format](https://en.wikipedia.org/wiki/ZIP_(file_format)) and formats based on 
it, such as [EPUB](https://en.wikipedia.org/wiki/EPUB), 
[JAR](https://en.wikipedia.org/wiki/JAR_(file_format)), 
[ODF](https://en.wikipedia.org/wiki/OpenDocument), 
[OOXML](https://en.wikipedia.org/wiki/Office_Open_XML)
   
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to