[jira] [Commented] (BEAM-4861) Hadoop Filesystem silently fails
[ https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594868#comment-16594868 ] Jozef Vilcek commented on BEAM-4861: Yes, make sense to me > Hadoop Filesystem silently fails > > > Key: BEAM-4861 > URL: https://issues.apache.org/jira/browse/BEAM-4861 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Reporter: Jozef Vilcek >Assignee: Chamikara Jayalath >Priority: Major > > Hi, > beam Filesystem operations copy, rename and delete are void in SDK. Hadoop > native filesystem operations are not and returns void. Current implementation > in Beam ignores the result and pass as long as exception is not thrown. > I got burned by this when using 'rename' to do a 'move' operation on HDFS. If > target directory does not exists, operations returns false and do not touch > the file. > [https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L148] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-4861) Hadoop Filesystem silently fails
[ https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594794#comment-16594794 ] Tim Robertson commented on BEAM-4861: - On further inspection I think {{delete}} and {{copy}} are correct to swallow a {{false}} response [~JozoVilcek] * A {{delete}} for example will return {{false}} when you try and delete a non existing file which seems reasonable to swallow. It will throw exception for the scenarios that mater. * The {{copy}} returns false only if there is issue with {{mkdirs}} and the HDFS docs [1] state that it always returns true even if the directory is not created [1] I think we can ignore the local filesystem implementation. For {{rename()}} we can create the directory if not existing and then should throw exception on any response that is false. [1] [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d] > Hadoop Filesystem silently fails > > > Key: BEAM-4861 > URL: https://issues.apache.org/jira/browse/BEAM-4861 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Reporter: Jozef Vilcek >Assignee: Chamikara Jayalath >Priority: Major > > Hi, > beam Filesystem operations copy, rename and delete are void in SDK. Hadoop > native filesystem operations are not and returns void. Current implementation > in Beam ignores the result and pass as long as exception is not thrown. > I got burned by this when using 'rename' to do a 'move' operation on HDFS. If > target directory does not exists, operations returns false and do not touch > the file. > [https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L148] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-4861) Hadoop Filesystem silently fails
[ https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593447#comment-16593447 ] Jozef Vilcek commented on BEAM-4861: For unsuccessful operation, I would throw exception as well. In practice, this is what is mostly done around the native HDFS boolean methods by helpers. Fail and investigate later what was wrong. For rename, create directories where necessary sounds good. Plus with allowing overwrites, behaviour would be consistent with what I observe on "normal file create" operations. Allow overwrite is maybe allowed for cases or restarting jobs form snapshots which can lead to reprocessing and recreating same outputs again? Not sure. > Hadoop Filesystem silently fails > > > Key: BEAM-4861 > URL: https://issues.apache.org/jira/browse/BEAM-4861 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Reporter: Jozef Vilcek >Assignee: Chamikara Jayalath >Priority: Major > > Hi, > beam Filesystem operations copy, rename and delete are void in SDK. Hadoop > native filesystem operations are not and returns void. Current implementation > in Beam ignores the result and pass as long as exception is not thrown. > I got burned by this when using 'rename' to do a 'move' operation on HDFS. If > target directory does not exists, operations returns false and do not touch > the file. > [https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L148] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-4861) Hadoop Filesystem silently fails
[ https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593406#comment-16593406 ] Tim Robertson commented on BEAM-4861: - The {{HadoopFileSystem}} has the following methods: {code:java} @Override protected void copy(List srcResourceIds, List destResourceIds) throws IOException { for (int i = 0; i < srcResourceIds.size(); ++i) { // Unfortunately HDFS FileSystems don't support a native copy operation so we are forced // to use the inefficient implementation found in FileUtil which copies all the bytes through // the local machine. // // HDFS FileSystem does define a concat method but could only find the DFSFileSystem // implementing it. The DFSFileSystem implemented concat by deleting the srcs after which // is not what we want. Also, all the other FileSystem implementations I saw threw // UnsupportedOperationException within concat. FileUtil.copy( fileSystem, srcResourceIds.get(i).toPath(), fileSystem, destResourceIds.get(i).toPath(), false, true, fileSystem.getConf()); } } @Override protected void rename( List srcResourceIds, List destResourceIds) throws IOException { for (int i = 0; i < srcResourceIds.size(); ++i) { fileSystem.rename(srcResourceIds.get(i).toPath(), destResourceIds.get(i).toPath()); } } @Override protected void delete(Collection resourceIds) throws IOException { for (HadoopResourceId resourceId : resourceIds) { fileSystem.delete(resourceId.toPath(), false); } } {code} {{FileUtil.copy}}, {{fileSystem.rename}} and {{fileSystem.delete}} can all return false indicating that the operation was not performed. *1. Informing the user of the unsuccessful operation* We could either: # Change the Beam {{FileSystem}} API to propagate this, although the [rules for HDFS rename()|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d] are not trivial to document and this might prove to be invasive in many places. # Throw an {{IOException}} to signal that the operation was not successful if the response is false? I tend towards suggesting leaving the API to return void but throw an exception on the first case of error within the loops - thoughts? ***2. rename() in HDFS* What do we believe are the expectations for {{rename()}} on HDFS? Currently the user is not informed if an attempt to rename a file into a non existent directory is made. This is obviously bad. We could change behaviour to one of: # Throw exception if the directory does not exist # Create the directory where necessary, letting files be overridden if it does exist (equivalent of e.g. {{S3Filesystem}}) # Verify that the directory does not exist, and only then create it and proceed, otherwise alerting with Exception (the usual behaviour of a {{MapReduce FileOutputFormat}} at job startup where it quickly fails with "directory already exists"). Note that {{S3FileSystem}} and {{GcsFileSystem}} treat a rename as a {{copy()}} and {{delete()}} operation internally. I tend towards creating the directory where necessary allowing for overwriting - thoughts? CC [~reuvenlax] as relates to BEAM-5036 as well. > Hadoop Filesystem silently fails > > > Key: BEAM-4861 > URL: https://issues.apache.org/jira/browse/BEAM-4861 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Reporter: Jozef Vilcek >Assignee: Chamikara Jayalath >Priority: Major > > Hi, > beam Filesystem operations copy, rename and delete are void in SDK. Hadoop > native filesystem operations are not and returns void. Current implementation > in Beam ignores the result and pass as long as exception is not thrown. > I got burned by this when using 'rename' to do a 'move' operation on HDFS. If > target directory does not exists, operations returns false and do not touch > the file. > [https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L148] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-4861) Hadoop Filesystem silently fails
[ https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590032#comment-16590032 ] Tim Robertson commented on BEAM-4861: - [~chamikara] - may I take this issue please as I am looking at BEAM-5036 which is related? > Hadoop Filesystem silently fails > > > Key: BEAM-4861 > URL: https://issues.apache.org/jira/browse/BEAM-4861 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Reporter: Jozef Vilcek >Assignee: Chamikara Jayalath >Priority: Major > > Hi, > beam Filesystem operations copy, rename and delete are void in SDK. Hadoop > native filesystem operations are not and returns void. Current implementation > in Beam ignores the result and pass as long as exception is not thrown. > I got burned by this when using 'rename' to do a 'move' operation on HDFS. If > target directory does not exists, operations returns false and do not touch > the file. > [https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L148] -- This message was sent by Atlassian JIRA (v7.6.3#76005)