[ 
https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15767871#comment-15767871
 ] 

Eugene Kirpichov commented on BEAM-1190:
----------------------------------------

My proposal is to add a mandatory stat at glob-expand time (and omit the file 
from glob expansion if it doesn't exist), but still throw an error if the file 
doesn't exist at read time. I think this is safe and should not require opt-in, 
since it doesn't seem to introduce new failure modes: both before and after the 
proposed solution we'll fail if a file doesn't exist at read time; but without 
it we may also erroneously fail if the file is included in glob expansion but 
actually doesn't exist at glob expansion time.

When Filesystem APIs are able to tell whether the file system is strongly 
consistent, then we can eliminate the stat as an optimization.

> FileBasedSource should ignore files that matched the glob but don't exist
> -------------------------------------------------------------------------
>
>                 Key: BEAM-1190
>                 URL: https://issues.apache.org/jira/browse/BEAM-1190
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-core
>            Reporter: Eugene Kirpichov
>            Assignee: Eugene Kirpichov
>
> See user issue:
> http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing
> We should, after globbing the files in FileBasedSource, individually stat 
> every file and remove those that don't exist, to account for the possibility 
> that glob yielded non-existing files due to eventual consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to