[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2017-06-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051013#comment-16051013
 ] 

Reynold Xin commented on SPARK-1:
-

But this ticket has nothing to do with SQL?

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2017-06-15 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050776#comment-16050776
 ] 

Xiao Li commented on SPARK-1:
-

This function is still missing in the SQL interface.

We can achieve the resolution by names by using the CORRESPONDING BY clause. 
For example,

{noformat}
(select * from t1) union corresponding by (c1, c2) (select * from t2);
{noformat}


> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2017-06-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050730#comment-16050730
 ] 

Reynold Xin commented on SPARK-1:
-

What's left in this ticket? Didn't we fix it already? If it is about by name 
resolution that belongs to a separate ticket. 

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2017-06-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048477#comment-16048477
 ] 

Joseph K. Bradley commented on SPARK-1:
---

Thanks for explaining!  I just rediscovered this issue (which I'd forgotten): 
[SPARK-21043]

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2017-03-23 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939347#comment-15939347
 ] 

Xiao Li commented on SPARK-1:
-

[~josephkb] In the SQL specification, the set operations are merging columns by 
column positions. However, it also provides another keyword {{CORRESPONDING}}. 
If users specify that keyword with set operations, we will merge them by column 
names, regardless of their positions. Do you want this feature? I can do it in 
the next release, since Spark 2.2 will be released soon. 

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2017-03-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939323#comment-15939323
 ] 

Joseph K. Bradley commented on SPARK-1:
---

[~smilegator] I wouldn't call that result "right."  That's definitely a bug 
unless the SQL specification is really messed up.

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-26 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170231#comment-15170231
 ] 

Xiao Li commented on SPARK-1:
-

[~josephkb] The result is right. unionall does not consider the column names.  

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170146#comment-15170146
 ] 

Joseph K. Bradley commented on SPARK-1:
---

[~rxin] Is this the same issue as the following?  This is in the current master:
{code}
val e1 = sqlContext.createDataFrame(List(
  ("a", "b"),
  ("b", "c"),
  ("c", "d")
)).toDF("src", "dst")
val e2 = e1.select(e1("src").as("dst"), e1("dst").as("src"))
val e3 = e1.unionAll(e2)
e3.show()

+---+---+
|src|dst|
+---+---+
|  a|  b|
|  b|  c|
|  c|  d|
|  a|  b|
|  b|  c|
|  c|  d|
+---+---+
{code}


> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-19 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155229#comment-15155229
 ] 

Xiao Li commented on SPARK-1:
-

Thanks! 

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155047#comment-15155047
 ] 

Reynold Xin commented on SPARK-1:
-

OK I think this is going to be really difficult to fix right now. However, once 
we refactor the sql internals and introduce the concept of a local plan tree, 
then we might be able to just hash the local plan tree and use that as the 
seed. 

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-17 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151746#comment-15151746
 ] 

Xiao Li commented on SPARK-1:
-

Yeah, you are right. This part is an issue. That is why I did not remove the 
existing rand/randn function with a seed argument. I am unable to find a 
solution for controlling data partitioning and task scheduling when users call 
rand/randn functions with a specific seed number.  

At least, we have to let users realize these results are not deterministic. 
Actually, I think this is also an issue in many of our test cases. Many test 
cases are using seeds with an assumption that rand/rand(seed) can provide a 
deterministic result. 


> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-17 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151720#comment-15151720
 ] 

Liang-Chi Hsieh commented on SPARK-1:
-

Yes. I agree that when user provides a specific seed number, the result should 
be deterministic. The problem is when you have many data partitions, you will 
have same random number sequences in each partition.

E.g, you have a table with 1000 rows evenly distributed in 5 partitions. If you 
do something required with random number involved for each row, your 201 ~ 400 
rows will be processed with same random numbers as the first 200 rows. Then 
your computation will not be random as it requires at the beginning. Your each 
partition will share a same pattern.

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-17 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151265#comment-15151265
 ] 

Xiao Li commented on SPARK-1:
-

Another example is MS SQL Server Rand()

https://msdn.microsoft.com/en-us/library/ms177610.aspx

{code}
seed
Is an integer expression (tinyint, smallint, or int) that gives the seed value. 
If seed is not specified, the SQL Server Database Engine assigns a seed value 
at random. For a specified seed value, the result returned is always the same.
{code}

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-17 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150550#comment-15150550
 ] 

Xiao Li commented on SPARK-1:
-

For example, in the JAVA document, 
https://docs.oracle.com/javase/7/docs/api/java/util/Random.html, they clearly 
explain the behavior when users specify the seed.

{code}
If two instances of Random are created with the same seed, and the same 
sequence of method calls is made for each, they will generate and return 
identical sequences of numbers.
{code}

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-17 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150541#comment-15150541
 ] 

Xiao Li commented on SPARK-1:
-

When users provide a specific seed number, they always expect a deterministic 
result. However, the current implementation still returns a non-deterministic 
result. If so, it does not make sense to provide users such an API with seed 
number. This will make users confused. Right?

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-17 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150072#comment-15150072
 ] 

Liang-Chi Hsieh commented on SPARK-1:
-

Isn't weird? Suppose each partition should have different random numbers.

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150053#comment-15150053
 ] 

Xiao Li commented on SPARK-1:
-

Yeah, the same series of random number. 

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150050#comment-15150050
 ] 

Liang-Chi Hsieh commented on SPARK-1:
-

But when you set deterministic to true, your each data partition will get same 
random values, right?

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150042#comment-15150042
 ] 

Xiao Li commented on SPARK-1:
-

Yeah. I realized it when fixing this problem. Thus, in the PR, I just added 
another parameter `deterministic` for rand and randn. If necessary, users can 
set `deterministic` to true.

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150008#comment-15150008
 ] 

Liang-Chi Hsieh commented on SPARK-1:
-

If you don't attach a partition id, wouldn't your each data partition have same 
random number?

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149850#comment-15149850
 ] 

Apache Spark commented on SPARK-1:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/11232

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149507#comment-15149507
 ] 

Xiao Li commented on SPARK-1:
-

Will try to submit a PR tonight. When users specify a seed, I am unable to find 
a reason why we need to add partition id into the seed value.

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149155#comment-15149155
 ] 

Xiao Li commented on SPARK-1:
-

[~josephkb] I found the root cause. : ) In the genCode of Randn and Rand, the 
seed is user-provided. However, the partitionID could be different. 
{code}
  s"$rngTerm = new $className(${seed}L + 
org.apache.spark.TaskContext.getPartitionId());")
{code}

If you remove that, you will get the right answer. 

{code}
  s"$rngTerm = new $className(${seed}L);")
{code}


> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149053#comment-15149053
 ] 

Joseph K. Bradley commented on SPARK-1:
---

I now have a much more complex example which does not use unionAll.  But it's 
still an issue with randn, so I suspect it's the same bug.  If needed, I can 
push a branch, but it's a mess of code.

[~smilegator] Thanks for taking a look.  I'll keep watching the JIRA!

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-15 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148212#comment-15148212
 ] 

Xiao Li commented on SPARK-1:
-

Tried join, intersect and except in 2.0. Works fine! For example, 
{code}
val df4 = df1.join(df2)
df4.explain(true)
println("DF4")
df4.show()
{code}

The problem exists only if we use {unionall}. : ) 

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-15 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148202#comment-15148202
 ] 

Xiao Li commented on SPARK-1:
-

Interesting. This query has specified the seed. Thus, it should return the same 
result. 

{code}
== Physical Plan ==
Union
:- WholeStageCodegen
:  :  +- Project [_1#0 AS id#1,randn(12345) AS b#2]
:  : +- Filter (_1#0 = 0)
:  :+- INPUT
:  +- LocalTableScan [_1#0], [[0],[1]]
+- WholeStageCodegen
   :  +- Project [_1#0 AS id#1,randn(12345) AS b#2]
   : +- Filter (_1#0 = 0)
   :+- INPUT
   +- LocalTableScan [_1#0], [[0],[1]]
{code}

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-15 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148187#comment-15148187
 ] 

Xiao Li commented on SPARK-1:
-

The current solution also has performance penalty. That has been mentioned in 
another JIRA: https://issues.apache.org/jira/browse/SPARK-2183

In the current logical plan, we lost the actual data sources for each member 
after deduplication. This is another issue. 

Let me at [~rxin] [~marmbrus] [~yhuai]

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-15 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148186#comment-15148186
 ] 

Xiao Li commented on SPARK-1:
-

You will get the right result if you cache the first DF
{code}
// Removing the following filter() call makes this give the expected result.
val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)).cache()
println("DF1")
df1.show()

val df2 = df1.select("id", "b")
println("DF2")
df2.show()  // same as df1.show(), as expected

val df3 = df1.unionAll(df2)
println("DF3")
df3.show()  // NOT two copies of df1, which is unexpected
{code}

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-15 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148182#comment-15148182
 ] 

Xiao Li commented on SPARK-1:
-

This is a known issue. The same issue exists in CTE with non-deterministic 
expression. 

For example, the following query could return a wrong result. 

{code}
With q as (select * from testData limit 10) select * from q as q1 inner join q 
as q2 where q1.key = q2.key
{code}

We need to materialize it before doing self unionall/join/intersect like 
operations. 

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-15 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148163#comment-15148163
 ] 

Xiao Li commented on SPARK-1:
-

Glad to work on this issue. Let me try it. Will keep you posted. Thanks!

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-15 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148135#comment-15148135
 ] 

Joseph K. Bradley commented on SPARK-1:
---

I haven't tested with 1.5 yet, but I assume it affects 1.5 as well.

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org