[ https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15561895#comment-15561895 ]
Ayan guha commented on SPARK-17845: ----------------------------------- IMHO, Window.rowsBetween() as a function name looks fine. What i would propose would be: Window.rowsBetween(startFrom=UNBOUNDED,endTo=CURRENT_ROW,preceeding=0,following=0) startFrom, endTo: Determining range preceeding,following: Anchor of current row and thus altering the range. Calls: //ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING Window.rowsBetween(preceeding=3,following=3) //ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING Window.rowsBetween(startFrom=UNBOUNDED,preceeding=3) //ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW Window.rowsBetween(startFrom=UNBOUNDED,endTo=CURRENT_ROW) //ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING (Not PRECEDING) Window.rowsBetween(startFrom=CURRENT_ROW,endTo=UNBOUNDED) //ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Window.rowsBetween(startFrom=UNBOUNDED,endTo=UNBOUNDED) One missing scenario (I think that is a valid one) //ROWS BETWEEN 3 FOLLOWING AND UNBOUNDED FOLLOWING (pair of (2)) Window.rowsBetween(endTo=UNBOUNDED,following=3) This will be closer to SQL options, IMHO. > Improve window function frame boundary API in DataFrame > ------------------------------------------------------- > > Key: SPARK-17845 > URL: https://issues.apache.org/jira/browse/SPARK-17845 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Reynold Xin > Assignee: Reynold Xin > > ANSI SQL uses the following to specify the frame boundaries for window > functions: > {code} > ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > {code} > In Spark's DataFrame API, we use integer values to indicate relative position: > - 0 means "CURRENT ROW" > - -1 means "1 PRECEDING" > - Long.MinValue means "UNBOUNDED PRECEDING" > - Long.MaxValue to indicate "UNBOUNDED FOLLOWING" > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsBetween(-3, +3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsBetween(Long.MinValue, -3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsBetween(Long.MinValue, 0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsBetween(0, Long.MaxValue) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > Window.rowsBetween(Long.MinValue, Long.MaxValue) > {code} > I think using numeric values to indicate relative positions is actually a > good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate > unbounded ends is pretty confusing: > 1. The API is not self-evident. There is no way for a new user to figure out > how to indicate an unbounded frame by looking at just the API. The user has > to read the doc to figure this out. > 2. It is weird Long.MinValue or Long.MaxValue has some special meaning. > 3. Different languages have different min/max values, e.g. in Python we use > -sys.maxsize and +sys.maxsize. > To make this API less confusing, we have a few options: > Option 1. Add the following (additional) methods: > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsBetween(-3, +3) // this one exists already > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsBetweenUnboundedPrecedingAnd(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsBetweenUnboundedPrecedingAndCurrentRow() > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsBetweenCurrentRowAndUnboundedFollowing() > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing() > {code} > This is obviously very verbose, but is very similar to how these functions > are done in SQL, and is perhaps the most obvious to end users, especially if > they come from SQL background. > Option 2. Decouple the specification for frame begin and frame end into two > functions. Assume the boundary is unlimited unless specified. > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsFrom(-3).rowsTo(3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsTo(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsToCurrent() or Window.rowsTo(0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsFromCurrent() or Window.rowsFrom(0) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > // no need to specify > {code} > If we go with option 2, we should throw exceptions if users specify multiple > from's or to's. A variant of option 2 is to require explicitly specification > of begin/end even in the case of unbounded boundary, e.g.: > {code} > Window.rowsFromBeginning().rowsTo(-3) > or > Window.rowsFromUnboundedPreceding().rowsTo(-3) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org