[jira] [Work logged] (BEAM-4620) UnboundedReadFromBoundedSource.split() should always call split()

ASF GitHub Bot (JIRA) Thu, 24 Jan 2019 09:35:57 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-4620?focusedWorklogId=189576&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-189576
 ]


ASF GitHub Bot logged work on BEAM-4620:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 24/Jan/19 17:29
            Start Date: 24/Jan/19 17:29
    Worklog Time Spent: 10m 
      Work Description: swegner commented on pull request #7555: [BEAM-4620] 
UnboundedReadFromBoundedSource invokes split for small bounded sources
URL: https://github.com/apache/beam/pull/7555#discussion_r249216191
 
 

 ##########
 File path: 
runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/UnboundedReadFromBoundedSource.java
 ##########
 @@ -127,9 +131,11 @@ public void validate() {
         long desiredBundleSize = boundedSource.getEstimatedSizeBytes(options) 
/ desiredNumSplits;
         if (desiredBundleSize <= 0) {
           LOG.warn(
-              "BoundedSource {} cannot estimate its size, skips the initial 
splits.",
 
 Review comment:
   I'm trying to reason about whether there's a scenario where this might 
happen with `boundedSource.getEstimatedSizeBytes(options) > 0`. For example, if 
a runner does over-splitting: dividing a small/normal estimated size by a large 
`desiredNumSplits` would round down to 0.
   
   In such a case, would it be better to use `desiredBundleSize = 1`? This will 
would result in lots of very small splits, but presumably for a runner doing 
oversplitting that's what they'd expect.
   
   Note: I'm not an expert here, just brainstorming. It's possible this 
scenario doesn't really exist, or that your proposed behavior would be more 
desirable.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 189576)
    Time Spent: 0.5h  (was: 20m)

> UnboundedReadFromBoundedSource.split() should always call split()
> -----------------------------------------------------------------
>
>                 Key: BEAM-4620
>                 URL: https://issues.apache.org/jira/browse/BEAM-4620
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-core
>            Reporter: Josh M
>            Assignee: Chamikara Jayalath
>            Priority: Minor
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> If the source contains too little data (rounds the result down to 0), or if 
> size estimation fails, then we don't call .split().  We must always call 
> split(); instead of returning original source, we should have computed some 
> fallback value for desiredBundleSize.
> This bug has existed since the code was first introduced, so it is not a 
> regression - but it is certainly a bug.
>  
> source: 
> https://github.com/apache/beam/blob/697a1d17e473cd5b097aaaeee24c08f43cc77f58/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/UnboundedReadFromBoundedSource.java#L137



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4620) UnboundedReadFromBoundedSource.split() should always call split()

Reply via email to