[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-18473:
-
Description: 
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add .fillna({ 'numOnesBefore': 0 }), although there are no visible effect 
on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code:python}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}

  was:
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add `.fillna({ 'numOnesBefore': 0 })`, although there are no visible effect 
on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read

[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-18473:
-
Description: 
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add `.fillna({ 'numOnesBefore': 0 })` -- although there are no visible 
effect on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}

  was:
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add .fillna({ 'numOnesBefore': 0 }), although there are no visible effect 
on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them

[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-18473:
-
Description: 
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add {code}.fillna({ 'numOnesBefore': 0 })` {code} after the other call to 
`fillna` -- although there are no visible effect on the dataframe as shown by 
`show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}

  was:
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add `.fillna({ 'numOnesBefore': 0 })` -- although there are no visible 
effect on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOI

[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-18473:
-
Description: 
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is *correct* if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
`fillna` -- although there are no visible effect on the dataframe as shown by 
`show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}

  was:
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add {code}.fillna({ 'numOnesBefore': 0 })` {code} after the other call to 
`fillna` -- although there are no visible effect on the dataframe as shown by 
`show` as far as I can tell.
- I

[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-18473:
-
Description: 
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how two rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is *correct* if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
`fillna` -- although there are no visible effect on the dataframe as shown by 
`show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}

  was:
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is *correct* if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
`fillna` -- although there are no visible effect on the dataframe as shown by 
`show` as far as I can tell.
- 

[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions

2019-05-13 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-18473:
---
Labels: correctness  (was: )

> Correctness issue in INNER join result with window functions
> 
>
> Key: SPARK-18473
> URL: https://issues.apache.org/jira/browse/SPARK-18473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.1
>Reporter: peay
>Assignee: Xiao Li
>Priority: Major
>  Labels: correctness
>
> I have stumbled onto a corner case where an INNER join appears to return 
> incorrect results. I believe the join should behave as the identity, but 
> instead, some values are shuffled around, and some are just plain wrong.
> This can be reproduced as follows: joining
> {code}
> +-+-+--+++--+--+
> |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
> +-+-+--+++--+--+
> |1|1| 1|   0|   1| 0| 1|
> |2|2| 0|   0|   1| 0| 1|
> |1|3| 1|   0|   2| 0| 2|
> +-+-+--+++--+--+
> {code}
> with
> {code}
> +--+
> |sessId|
> +--+
> | 1|
> | 2|
> +--+
> {code}
> The result is
> {code}
> +--+-+-+--+++--+
> |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
> +--+-+-+--+++--+
> | 1|2|2| 0|   0|   1| 0|
> | 2|1|1| 1|   0|   1|-1|
> | 2|1|3| 1|   0|   2| 0|
> +--+-+-+--+++--+
> {code}
> Note how two rows have a sessId of 2 (instead of one row as expected), and 
> how `fiftyCount` can now be negative while always zero in the original 
> dataframe.
> The first dataframe uses two windows:
> - `hasOne` uses a `window.rowsBetween(-10, 0)`.
> - `hasFifty` uses a `window.rowsBetween(-10, -1)`.
> The result is *correct* if:
> - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
> `window.rowsBetween(-10, -1)`.
> - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
> `fillna` -- although there are no visible effect on the dataframe as shown by 
> `show` as far as I can tell.
> - I use a LEFT OUTER join instead of INNER JOIN.
> - I write both dataframes to Parquet, read them back and join these.
> This can be reproduced in pyspark using:
> {code}
> import pyspark.sql.functions as F
> from pyspark.sql.functions import col
> from pyspark.sql.window import Window
> df1 = sql_context.createDataFrame(
> pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
> )
> window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")
> df2 = (
> df1
> .withColumn("hasOne", (col("index") == 1).cast("int"))
> .withColumn("hasFifty", (col("index") == 50).cast("int"))
> .withColumn("numOnesBefore", 
> F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
> .withColumn("numFiftyStrictlyBefore", 
> F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
> .fillna({ 'numFiftyStrictlyBefore': 0 })
> .withColumn("sessId", col("numOnesBefore") - 
> col("numFiftyStrictlyBefore"))
> )
> df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
> df_joined = df_selector.join(df2, "sessId", how="inner")
> df2.show()
> df_selector.show()
> df_joined.show()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org