Kengo Seki created AIRFLOW-2382:
-----------------------------------

             Summary: Fix wrong description for delimiter
                 Key: AIRFLOW-2382
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2382
             Project: Apache Airflow
          Issue Type: Bug
          Components: aws, operators
            Reporter: Kengo Seki


The document for S3ListOperator says:

{code}
:param delimiter: The delimiter by which you want to filter the objects.
    For e.g to lists the CSV files from in a directory in S3 you would use
    delimiter='.csv'.
{code}

{code}
**Example**:
    The following operator would list all the CSV files from the S3
    ``customers/2018/04/`` key in the ``data`` bucket. ::

        s3_file = S3ListOperator(
            task_id='list_3s_files',
            bucket='data',
            prefix='customers/2018/04/',
            delimiter='.csv',
            aws_conn_id='aws_customers_conn'
        )
{code}

but it actually behaves oppositely:

{code}
In [1]: from airflow.contrib.operators.s3_list_operator import S3ListOperator

In [2]: S3ListOperator(task_id='t', bucket='bkt0', prefix='', 
aws_conn_id='s3').execute(None)
[2018-04-26 10:34:27,001] {connectionpool.py:735} INFO - Starting new HTTPS 
connection (1): bkt0.s3.amazonaws.com
[2018-04-26 10:34:27,711] {connectionpool.py:735} INFO - Starting new HTTPS 
connection (1): bkt0.s3-ap-northeast-1.amazonaws.com
[2018-04-26 10:34:27,801] {connectionpool.py:735} INFO - Starting new HTTPS 
connection (1): bkt0.s3.ap-northeast-1.amazonaws.com
Out[2]: ['0.csv', '1.txt', '2.jpg', '3.exe']

In [3]: S3ListOperator(task_id='t', bucket='bkt0', prefix='', aws_conn_id='s3', 
delimiter='.csv').execute(None)
[2018-04-26 10:34:39,722] {connectionpool.py:735} INFO - Starting new HTTPS 
connection (1): bkt0.s3.amazonaws.com
[2018-04-26 10:34:40,483] {connectionpool.py:735} INFO - Starting new HTTPS 
connection (1): bkt0.s3-ap-northeast-1.amazonaws.com
[2018-04-26 10:34:40,569] {connectionpool.py:735} INFO - Starting new HTTPS 
connection (1): bkt0.s3.ap-northeast-1.amazonaws.com
Out[3]: ['1.txt', '2.jpg', '3.exe']
{code}

This is because that the 'delimiter' parameter is for representing path 
hierarchy (so '/' is used typically), not file extension. Also 
S3ToGoogleCloudStorageOperator has the same problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to