[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-18473: - Description: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add .fillna({ 'numOnesBefore': 0 }), although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code:python} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} was: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add `.fillna({ 'numOnesBefore': 0 })`, although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read
[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-18473: - Description: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add `.fillna({ 'numOnesBefore': 0 })` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} was: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add .fillna({ 'numOnesBefore': 0 }), although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them
[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-18473: - Description: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add {code}.fillna({ 'numOnesBefore': 0 })` {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} was: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add `.fillna({ 'numOnesBefore': 0 })` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOI
[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-18473: - Description: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is *correct* if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} was: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is **correct** if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add {code}.fillna({ 'numOnesBefore': 0 })` {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I
[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay updated SPARK-18473: - Description: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how two rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is *correct* if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. - I use a LEFT OUTER join instead of INNER JOIN. - I write both dataframes to Parquet, read them back and join these. This can be reproduced in pyspark using: {code} import pyspark.sql.functions as F from pyspark.sql.functions import col from pyspark.sql.window import Window df1 = sql_context.createDataFrame( pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) ) window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") df2 = ( df1 .withColumn("hasOne", (col("index") == 1).cast("int")) .withColumn("hasFifty", (col("index") == 50).cast("int")) .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) .fillna({ 'numFiftyStrictlyBefore': 0 }) .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) ) df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) df_joined = df_selector.join(df2, "sessId", how="inner") df2.show() df_selector.show() df_joined.show() {code} was: I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. This can be reproduced as follows: joining {code} +-+-+--+++--+--+ |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| +-+-+--+++--+--+ |1|1| 1| 0| 1| 0| 1| |2|2| 0| 0| 1| 0| 1| |1|3| 1| 0| 2| 0| 2| +-+-+--+++--+--+ {code} with {code} +--+ |sessId| +--+ | 1| | 2| +--+ {code} The result is {code} +--+-+-+--+++--+ |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| +--+-+-+--+++--+ | 1|2|2| 0| 0| 1| 0| | 2|1|1| 1| 0| 1|-1| | 2|1|3| 1| 0| 2| 0| +--+-+-+--+++--+ {code} Note how rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. The first dataframe uses two windows: - `hasOne` uses a `window.rowsBetween(-10, 0)`. - `hasFifty` uses a `window.rowsBetween(-10, -1)`. The result is *correct* if: - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. -
[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions
[ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-18473: --- Labels: correctness (was: ) > Correctness issue in INNER join result with window functions > > > Key: SPARK-18473 > URL: https://issues.apache.org/jira/browse/SPARK-18473 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.1 >Reporter: peay >Assignee: Xiao Li >Priority: Major > Labels: correctness > > I have stumbled onto a corner case where an INNER join appears to return > incorrect results. I believe the join should behave as the identity, but > instead, some values are shuffled around, and some are just plain wrong. > This can be reproduced as follows: joining > {code} > +-+-+--+++--+--+ > |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| > +-+-+--+++--+--+ > |1|1| 1| 0| 1| 0| 1| > |2|2| 0| 0| 1| 0| 1| > |1|3| 1| 0| 2| 0| 2| > +-+-+--+++--+--+ > {code} > with > {code} > +--+ > |sessId| > +--+ > | 1| > | 2| > +--+ > {code} > The result is > {code} > +--+-+-+--+++--+ > |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| > +--+-+-+--+++--+ > | 1|2|2| 0| 0| 1| 0| > | 2|1|1| 1| 0| 1|-1| > | 2|1|3| 1| 0| 2| 0| > +--+-+-+--+++--+ > {code} > Note how two rows have a sessId of 2 (instead of one row as expected), and > how `fiftyCount` can now be negative while always zero in the original > dataframe. > The first dataframe uses two windows: > - `hasOne` uses a `window.rowsBetween(-10, 0)`. > - `hasFifty` uses a `window.rowsBetween(-10, -1)`. > The result is *correct* if: > - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of > `window.rowsBetween(-10, -1)`. > - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to > `fillna` -- although there are no visible effect on the dataframe as shown by > `show` as far as I can tell. > - I use a LEFT OUTER join instead of INNER JOIN. > - I write both dataframes to Parquet, read them back and join these. > This can be reproduced in pyspark using: > {code} > import pyspark.sql.functions as F > from pyspark.sql.functions import col > from pyspark.sql.window import Window > df1 = sql_context.createDataFrame( > pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) > ) > window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") > df2 = ( > df1 > .withColumn("hasOne", (col("index") == 1).cast("int")) > .withColumn("hasFifty", (col("index") == 50).cast("int")) > .withColumn("numOnesBefore", > F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) > .withColumn("numFiftyStrictlyBefore", > F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) > .fillna({ 'numFiftyStrictlyBefore': 0 }) > .withColumn("sessId", col("numOnesBefore") - > col("numFiftyStrictlyBefore")) > ) > df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) > df_joined = df_selector.join(df2, "sessId", how="inner") > df2.show() > df_selector.show() > df_joined.show() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org