[ https://issues.apache.org/jira/browse/ARROW-7805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
bb updated ARROW-7805: ---------------------- Description: The Pyarrow lib (using LIBHDFS) appears to default to a SkipTrash option (which is _not_ the Hadoop default behavior). This turned out to be a pretty major issue for a recent project. The HadoopFileSystem `delete` method currently has a default behavior of `recursive=False` I would believe a similar `skipTrash=False` default would be appropriate here. Or, if this is not possible it may be appropriate and best to print a `WARNING` or something to the console to warn users that this is a point-of-no-return. {code:java} # test using hadoop fs shell commands # setup test & confirm that file exists $ testfile="/user/myusername/testfile1" $ hadoop fs -touchz $testfile && hadoop fs -ls $testfile -rw-r----- 3 myusername mygroup 0 2020-02-08 13:25 /user/myusername/testfile1 # remove the file and confirm that it is moved to the Trash $ hadoop fs -rm $testfile 20/02/08 13:26:04 INFO fs.TrashPolicyDefault: Moved: 'hdfs://nameservice1/user/myusername/testfile1' to trash at: hdfs://nameservice1/user/.Trash/myusername/Current/user/myusername/testfile1 # verify that it is in the Trash $ hadoop fs -ls /user/.Trash/myusername/Current/user/myusername/testfile1 -rw-r----- 3 myusername mygroup 0 2020-02-08 13:25 /user/.Trash/myusername/Current/user/myusername/testfile1 {code} {code:java} # test using pyarrow import os import subprocess from app import conf LIBHDFS_PATH = conf["libhdfs_path"] os.environ["ARROW_LIBHDFS_DIR"] = LIBHDFS_PATH import pyarrow TEST_FILE = 'testfile2' TEST_FILE_PATH = f'/user/myusername/{TEST_FILE}' TRASH_FILE_PATH = f'/user/.Trash/myusername/Current/user/myusername/{TEST_FILE}' fs = pyarrow.hdfs.connect(driver="libhdfs") def setup_test(): """Create the testfile""" print('create test file...') subprocess.run(f'hadoop fs -touchz {TEST_FILE_PATH}'.split()) def run_test(): """run the removal, try to remove the file and verify if it's moved to the Trash""" setup_test() try: print(f'check if test file {TEST_FILE} exists: {bool(fs.ls(TEST_FILE))}') print(f'attempt to remove: {TEST_FILE}') fs.rm(TEST_FILE) print(f'file {TEST_FILE} removed successfully') except: print('encountered an error in run_test') def check_file_in_hdfs_trash(): try: fs.ls(TRASH_FILE_PATH) except: print(f'test file {TEST_FILE} not found in {TRASH_FILE_PATH}!!') run_test() check_file_in_hdfs_trash() # output... create test file... check if test file testfile2 exists: True attempt to remove: testfile2 file testfile2 removed successfully test file testfile2 not found in /user/.Trash/myusername/Current/user/myusername/testfile2!! {code} was: The Pyarrow lib (using LIBHDFS) appears to default to a SkipTrash option (which is _not_ the Hadoop default behavior). This turned out to be a pretty major issue for a project. The HadoopFileSystem `delete` method currently has a default behavior of `recursive=False` I would believe a similar `skipTrash=False` default would be appropriate here. {code:java} # test using hadoop fs shell commands # setup test & confirm that file exists $ testfile="/user/myusername/testfile1" $ hadoop fs -touchz $testfile && hadoop fs -ls $testfile -rw-r----- 3 myusername mygroup 0 2020-02-08 13:25 /user/myusername/testfile1 # remove the file and confirm that it is moved to the Trash $ hadoop fs -rm $testfile 20/02/08 13:26:04 INFO fs.TrashPolicyDefault: Moved: 'hdfs://nameservice1/user/myusername/testfile1' to trash at: hdfs://nameservice1/user/.Trash/myusername/Current/user/myusername/testfile1 # verify that it is in the Trash $ hadoop fs -ls /user/.Trash/myusername/Current/user/myusername/testfile1 -rw-r----- 3 myusername mygroup 0 2020-02-08 13:25 /user/.Trash/myusername/Current/user/myusername/testfile1 {code} {code:java} # test using pyarrow import os import subprocess from app import conf LIBHDFS_PATH = conf["libhdfs_path"] os.environ["ARROW_LIBHDFS_DIR"] = LIBHDFS_PATH import pyarrow TEST_FILE = 'testfile2' TEST_FILE_PATH = f'/user/myusername/{TEST_FILE}' TRASH_FILE_PATH = f'/user/.Trash/myusername/Current/user/myusername/{TEST_FILE}' fs = pyarrow.hdfs.connect(driver="libhdfs") def setup_test(): """Create the testfile""" print('create test file...') subprocess.run(f'hadoop fs -touchz {TEST_FILE_PATH}'.split()) def run_test(): """run the removal, try to remove the file and verify if it's moved to the Trash""" setup_test() try: print(f'check if test file {TEST_FILE} exists: {bool(fs.ls(TEST_FILE))}') print(f'attempt to remove: {TEST_FILE}') fs.rm(TEST_FILE) print(f'file {TEST_FILE} removed successfully') except: print('encountered an error in run_test') def check_file_in_hdfs_trash(): try: fs.ls(TRASH_FILE_PATH) except: print(f'test file {TEST_FILE} not found in {TRASH_FILE_PATH}!!') run_test() check_file_in_hdfs_trash() # output... create test file... check if test file testfile2 exists: True attempt to remove: testfile2 file testfile2 removed successfully test file testfile2 not found in /user/.Trash/myusername/Current/user/myusername/testfile2!! {code} > Apache Arrow HDFS Remove (rm) operation defaults to SkipTrash > ------------------------------------------------------------- > > Key: ARROW-7805 > URL: https://issues.apache.org/jira/browse/ARROW-7805 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Affects Versions: 0.13.0 > Reporter: bb > Priority: Major > > The Pyarrow lib (using LIBHDFS) appears to default to a SkipTrash option > (which is _not_ the Hadoop default behavior). This turned out to be a pretty > major issue for a recent project. The HadoopFileSystem `delete` method > currently has a default behavior of `recursive=False` I would believe a > similar `skipTrash=False` default would be appropriate here. > Or, if this is not possible it may be appropriate and best to print a > `WARNING` or something to the console to warn users that this is a > point-of-no-return. > > {code:java} > # test using hadoop fs shell commands > # setup test & confirm that file exists > $ testfile="/user/myusername/testfile1" > $ hadoop fs -touchz $testfile && hadoop fs -ls $testfile > -rw-r----- 3 myusername mygroup 0 2020-02-08 13:25 /user/myusername/testfile1 > # remove the file and confirm that it is moved to the Trash > $ hadoop fs -rm $testfile > 20/02/08 13:26:04 INFO fs.TrashPolicyDefault: Moved: > 'hdfs://nameservice1/user/myusername/testfile1' to trash at: > hdfs://nameservice1/user/.Trash/myusername/Current/user/myusername/testfile1 > # verify that it is in the Trash > $ hadoop fs -ls /user/.Trash/myusername/Current/user/myusername/testfile1 > -rw-r----- 3 myusername mygroup 0 2020-02-08 13:25 > /user/.Trash/myusername/Current/user/myusername/testfile1 > {code} > {code:java} > # test using pyarrow > import os > import subprocess > from app import conf > LIBHDFS_PATH = conf["libhdfs_path"] > os.environ["ARROW_LIBHDFS_DIR"] = LIBHDFS_PATH > import pyarrow > TEST_FILE = 'testfile2' > TEST_FILE_PATH = f'/user/myusername/{TEST_FILE}' > TRASH_FILE_PATH = > f'/user/.Trash/myusername/Current/user/myusername/{TEST_FILE}' > fs = pyarrow.hdfs.connect(driver="libhdfs") > def setup_test(): > """Create the testfile""" > print('create test file...') > subprocess.run(f'hadoop fs -touchz {TEST_FILE_PATH}'.split()) > def run_test(): > """run the removal, try to remove the file and verify if it's moved to > the Trash""" > setup_test() > try: > print(f'check if test file {TEST_FILE} exists: > {bool(fs.ls(TEST_FILE))}') > print(f'attempt to remove: {TEST_FILE}') > fs.rm(TEST_FILE) > print(f'file {TEST_FILE} removed successfully') > except: > print('encountered an error in run_test') > def check_file_in_hdfs_trash(): > try: > fs.ls(TRASH_FILE_PATH) > except: > print(f'test file {TEST_FILE} not found in {TRASH_FILE_PATH}!!') > run_test() > check_file_in_hdfs_trash() > # output... > create test file... > check if test file testfile2 exists: True > attempt to remove: testfile2 > file testfile2 removed successfully > test file testfile2 not found in > /user/.Trash/myusername/Current/user/myusername/testfile2!! > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)