[jira] [Commented] (BEAM-4861) Hadoop Filesystem silently fails

2018-08-28 Thread Jozef Vilcek (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594868#comment-16594868
 ] 

Jozef Vilcek commented on BEAM-4861:


Yes, make sense to me

> Hadoop Filesystem silently fails
> 
>
> Key: BEAM-4861
> URL: https://issues.apache.org/jira/browse/BEAM-4861
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Reporter: Jozef Vilcek
>Assignee: Chamikara Jayalath
>Priority: Major
>
> Hi,
> beam Filesystem operations copy, rename and delete are void in SDK. Hadoop 
> native filesystem operations are not and returns void. Current implementation 
> in Beam ignores the result and pass as long as exception is not thrown.
> I got burned by this when using 'rename' to do a 'move' operation on HDFS. If 
> target directory does not exists, operations returns false and do not touch 
> the file.
> [https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L148]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4861) Hadoop Filesystem silently fails

2018-08-28 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594794#comment-16594794
 ] 

Tim Robertson commented on BEAM-4861:
-

On further inspection I think {{delete}} and {{copy}} are correct to swallow a 
{{false}} response [~JozoVilcek] 
 * A {{delete}} for example will return {{false}} when you try and delete a non 
existing file which seems reasonable to swallow. It will throw exception for 
the scenarios that mater.
 * The {{copy}} returns false only if there is issue with {{mkdirs}} and the 
HDFS docs [1] state that it always returns true even if the directory is not 
created [1] I think we can ignore the local filesystem implementation.

For {{rename()}} we can create the directory if not existing and then should 
throw exception on any response that is false. 

 

[1] 
[https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d]
 

 

> Hadoop Filesystem silently fails
> 
>
> Key: BEAM-4861
> URL: https://issues.apache.org/jira/browse/BEAM-4861
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Reporter: Jozef Vilcek
>Assignee: Chamikara Jayalath
>Priority: Major
>
> Hi,
> beam Filesystem operations copy, rename and delete are void in SDK. Hadoop 
> native filesystem operations are not and returns void. Current implementation 
> in Beam ignores the result and pass as long as exception is not thrown.
> I got burned by this when using 'rename' to do a 'move' operation on HDFS. If 
> target directory does not exists, operations returns false and do not touch 
> the file.
> [https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L148]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4861) Hadoop Filesystem silently fails

2018-08-27 Thread Jozef Vilcek (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593447#comment-16593447
 ] 

Jozef Vilcek commented on BEAM-4861:


For unsuccessful operation, I would throw exception as well. In practice, this 
is what is mostly done around the native HDFS boolean methods by helpers. Fail 
and investigate later what was wrong.

For rename, create directories where necessary sounds good. Plus with allowing 
overwrites, behaviour would be consistent with what I observe on "normal file 
create" operations. Allow overwrite is maybe allowed for cases or restarting 
jobs form snapshots which can lead to reprocessing and recreating same outputs 
again? Not sure.

> Hadoop Filesystem silently fails
> 
>
> Key: BEAM-4861
> URL: https://issues.apache.org/jira/browse/BEAM-4861
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Reporter: Jozef Vilcek
>Assignee: Chamikara Jayalath
>Priority: Major
>
> Hi,
> beam Filesystem operations copy, rename and delete are void in SDK. Hadoop 
> native filesystem operations are not and returns void. Current implementation 
> in Beam ignores the result and pass as long as exception is not thrown.
> I got burned by this when using 'rename' to do a 'move' operation on HDFS. If 
> target directory does not exists, operations returns false and do not touch 
> the file.
> [https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L148]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4861) Hadoop Filesystem silently fails

2018-08-27 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593406#comment-16593406
 ] 

Tim Robertson commented on BEAM-4861:
-

The {{HadoopFileSystem}} has the following methods:
{code:java}
  @Override
  protected void copy(List srcResourceIds, 
List destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  // Unfortunately HDFS FileSystems don't support a native copy operation 
so we are forced
  // to use the inefficient implementation found in FileUtil which copies 
all the bytes through
  // the local machine.
  //
  // HDFS FileSystem does define a concat method but could only find the 
DFSFileSystem
  // implementing it. The DFSFileSystem implemented concat by deleting the 
srcs after which
  // is not what we want. Also, all the other FileSystem implementations I 
saw threw
  // UnsupportedOperationException within concat.
  FileUtil.copy(
  fileSystem,
  srcResourceIds.get(i).toPath(),
  fileSystem,
  destResourceIds.get(i).toPath(),
  false,
  true,
  fileSystem.getConf());
}
  }

  @Override
  protected void rename(
  List srcResourceIds, List 
destResourceIds)
  throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
  fileSystem.rename(srcResourceIds.get(i).toPath(), 
destResourceIds.get(i).toPath());
}
  }

  @Override
  protected void delete(Collection resourceIds) throws 
IOException {
for (HadoopResourceId resourceId : resourceIds) {
  fileSystem.delete(resourceId.toPath(), false);
}
  }
{code}
{{FileUtil.copy}}, {{fileSystem.rename}} and {{fileSystem.delete}} can all 
return false indicating that the operation was not performed.

 

*1. Informing the user of the unsuccessful operation*

We could either:
 # Change the Beam {{FileSystem}} API to propagate this, although the [rules 
for HDFS 
rename()|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d]
 are not trivial to document and this might prove to be invasive in many places.
 # Throw an {{IOException}} to signal that the operation was not successful if 
the response is false?

I tend towards suggesting leaving the API to return void but throw an exception 
on the first case of error within the loops - thoughts?

 

***2. rename() in HDFS*

What do we believe are the expectations for {{rename()}} on HDFS?

Currently the user is not informed if an attempt to rename a file into a non 
existent directory is made. This is obviously bad.

We could change behaviour to one of:
 # Throw exception if the directory does not exist
 # Create the directory where necessary, letting files be overridden if it does 
exist (equivalent of e.g. {{S3Filesystem}})
 # Verify that the directory does not exist, and only then create it and 
proceed, otherwise alerting with Exception (the usual behaviour of a 
{{MapReduce FileOutputFormat}} at job startup where it quickly fails with 
"directory already exists").

Note that {{S3FileSystem}} and {{GcsFileSystem}} treat a rename as a {{copy()}} 
and {{delete()}} operation internally.

I tend towards creating the directory where necessary allowing for overwriting 
- thoughts?

CC [~reuvenlax] as relates to BEAM-5036 as well.

> Hadoop Filesystem silently fails
> 
>
> Key: BEAM-4861
> URL: https://issues.apache.org/jira/browse/BEAM-4861
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Reporter: Jozef Vilcek
>Assignee: Chamikara Jayalath
>Priority: Major
>
> Hi,
> beam Filesystem operations copy, rename and delete are void in SDK. Hadoop 
> native filesystem operations are not and returns void. Current implementation 
> in Beam ignores the result and pass as long as exception is not thrown.
> I got burned by this when using 'rename' to do a 'move' operation on HDFS. If 
> target directory does not exists, operations returns false and do not touch 
> the file.
> [https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L148]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4861) Hadoop Filesystem silently fails

2018-08-23 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590032#comment-16590032
 ] 

Tim Robertson commented on BEAM-4861:
-

[~chamikara]  - may I take this issue please as I am looking at BEAM-5036 which 
is related?

> Hadoop Filesystem silently fails
> 
>
> Key: BEAM-4861
> URL: https://issues.apache.org/jira/browse/BEAM-4861
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Reporter: Jozef Vilcek
>Assignee: Chamikara Jayalath
>Priority: Major
>
> Hi,
> beam Filesystem operations copy, rename and delete are void in SDK. Hadoop 
> native filesystem operations are not and returns void. Current implementation 
> in Beam ignores the result and pass as long as exception is not thrown.
> I got burned by this when using 'rename' to do a 'move' operation on HDFS. If 
> target directory does not exists, operations returns false and do not touch 
> the file.
> [https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L148]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)