[ https://issues.apache.org/jira/browse/BEAM-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eugene Kirpichov closed BEAM-1841. ---------------------------------- Resolution: Won't Fix Fix Version/s: Not applicable Probably not worth investing more into FileBasedSource, since SDF is coming. > FileBasedSource should have safeguards for when set of files grows while job > is running > --------------------------------------------------------------------------------------- > > Key: BEAM-1841 > URL: https://issues.apache.org/jira/browse/BEAM-1841 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-java-core > Reporter: Eugene Kirpichov > Assignee: Eugene Kirpichov > Priority: Major > Fix For: Not applicable > > > In some cases people run pipelines over directories where the set of files in > the directory grows while the job runs. This may lead to a situation like > this, in particular with Dataflow runner: > At job submission time, the FileBasedSource estimates the current size of the > filepattern, and ends up with a small number. Dataflow runner chooses thus a > small desiredBundleSizeBytes to pass to .splitIntoBundles(). However, at the > time splitIntoBundles() runs, the set of files has greatly grown, and we > produce many more, unnecessarily small bundles, than anticipated. > I see a few things we could do: > - In splitIntoBundles(), compute the actual size and detect when the desired > size is unreasonably small for it; e.g. set an upper threshold on how many > bundles we produce in total. > - Somehow remember, at submission time, what was the estimated size. Then, in > splitIntoBundles(), compute the actual current size, and scale > desiredBundleSizeBytes accordingly to get approximately the intended number > of bundles. Caveat: files may still change between the moment size is > estimated and the moment splitting happens. > - (much larger in scope) Change the whole protocol to use number of bundles > instead of bundle size bytes. This probably won't happen with BoundedSource, > but it is going to be the case with Splittable DoFn. > Option 1 seems by far the simplest. -- This message was sent by Atlassian JIRA (v7.6.3#76005)