RE: How this unit test passed on master trunk?

2016-04-24 Thread Yong Zhang
So in that case then the result will be following:
[1,[1,1]][3,[3,1]][2,[2,1]]Thanks for explaining the meaning of the it. But the 
question is that how first() will be [3,[1,1]]? In fact, if there were any 
ordering in the final result, it will be [1,[1,1]], instead of [3,[1,1]], 
correct? 
Yong
Subject: Re: How this unit test passed on master trunk?
From: zzh...@hortonworks.com
To: java8...@hotmail.com; gatorsm...@gmail.com
CC: user@spark.apache.org
Date: Sun, 24 Apr 2016 04:37:11 +






There are multiple records for the DF




scala> structDF.groupBy($"a").agg(min(struct($"record.*"))).show
+---+-+
|  a|min(struct(unresolvedstar()))|
+---+-+
|  1|[1,1]|
|  3|[3,1]|
|  2|[2,1]|



The meaning of .groupBy($"a").agg(min(struct($"record.*"))) is to get the min 
for all the records with the same $”a”



For example: TestData2(1,1) :: TestData2(1,2) The result would be 1, (1, 1), 
since struct(1, 1) is less than struct(1, 2). Please check how the Ordering is 
implemented in InterpretedOrdering.



The output itself does not have any ordering. I am not sure why the unit test 
and the real env have different environment.



Xiao,



I do see the difference between unit test and local cluster run. Do you know 
the reason?



Thanks.



Zhan Zhang









 

On Apr 22, 2016, at 11:23 AM, Yong Zhang <java8...@hotmail.com> wrote:



Hi,



I was trying to find out why this unit test can pass in Spark code.



in
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala



for this unit test:

  test("Star Expansion - CreateStruct and CreateArray") {
val structDf = testData2.select("a", "b").as("record")
// CreateStruct and CreateArray in aggregateExpressions
assert(structDf.groupBy($"a").agg(min(struct($"record.*"))).first() == 
Row(3, Row(3, 1)))
assert(structDf.groupBy($"a").agg(min(array($"record.*"))).first() == 
Row(3, Seq(3, 1)))

// CreateStruct and CreateArray in project list (unresolved alias)
assert(structDf.select(struct($"record.*")).first() == Row(Row(1, 1)))
assert(structDf.select(array($"record.*")).first().getAs[Seq[Int]](0) === 
Seq(1, 1))

// CreateStruct and CreateArray in project list (alias)
assert(structDf.select(struct($"record.*").as("a")).first() == Row(Row(1, 
1)))

assert(structDf.select(array($"record.*").as("a")).first().getAs[Seq[Int]](0) 
=== Seq(1, 1))
  }
>From my understanding, the data return in this case should be Row(1, Row(1, 
>1]), as that will be min of struct.
In fact, if I run the spark-shell on my laptop, and I got the result I expected:


./bin/spark-shell
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> case class TestData2(a: Int, b: Int)
defined class TestData2
scala> val testData2DF = sqlContext.sparkContext.parallelize(TestData2(1,1) :: 
TestData2(1,2) :: TestData2(2,1) :: TestData2(2,2) :: TestData2(3,1) :: 
TestData2(3,2) :: Nil, 2).toDF()
scala> val structDF = testData2DF.select("a","b").as("record")
scala> structDF.groupBy($"a").agg(min(struct($"record.*"))).first()
res0: org.apache.spark.sql.Row = [1,[1,1]]

scala> structDF.show
+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  1|  2|
|  2|  1|
|  2|  2|
|  3|  1|
|  3|  2|
+---+---+
So from my spark, which I built on the master, I cannot get Row[3,[1,1]] back 
in this case. Why the unit test asserts that Row[3,[1,1]] should be the first, 
and it will pass? But I cannot reproduce that in my spark-shell? I am trying to 
understand how to interpret the meaning of "agg(min(struct($"record.*")))"


Thanks
Yong 







  

Re: How this unit test passed on master trunk?

2016-04-23 Thread Zhan Zhang
There are multiple records for the DF

scala> structDF.groupBy($"a").agg(min(struct($"record.*"))).show
+---+-+
|  a|min(struct(unresolvedstar()))|
+---+-+
|  1|[1,1]|
|  3|[3,1]|
|  2|[2,1]|

The meaning of .groupBy($"a").agg(min(struct($"record.*"))) is to get the min 
for all the records with the same $”a”

For example: TestData2(1,1) :: TestData2(1,2) The result would be 1, (1, 1), 
since struct(1, 1) is less than struct(1, 2). Please check how the Ordering is 
implemented in InterpretedOrdering.

The output itself does not have any ordering. I am not sure why the unit test 
and the real env have different environment.

Xiao,

I do see the difference between unit test and local cluster run. Do you know 
the reason?

Thanks.

Zhan Zhang




On Apr 22, 2016, at 11:23 AM, Yong Zhang 
> wrote:

Hi,

I was trying to find out why this unit test can pass in Spark code.

in
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

for this unit test:

  test("Star Expansion - CreateStruct and CreateArray") {
val structDf = testData2.select("a", "b").as("record")
// CreateStruct and CreateArray in aggregateExpressions
assert(structDf.groupBy($"a").agg(min(struct($"record.*"))).first() == 
Row(3, Row(3, 1)))
assert(structDf.groupBy($"a").agg(min(array($"record.*"))).first() == 
Row(3, Seq(3, 1)))

// CreateStruct and CreateArray in project list (unresolved alias)
assert(structDf.select(struct($"record.*")).first() == Row(Row(1, 1)))
assert(structDf.select(array($"record.*")).first().getAs[Seq[Int]](0) === 
Seq(1, 1))

// CreateStruct and CreateArray in project list (alias)
assert(structDf.select(struct($"record.*").as("a")).first() == Row(Row(1, 
1)))

assert(structDf.select(array($"record.*").as("a")).first().getAs[Seq[Int]](0) 
=== Seq(1, 1))
  }

>From my understanding, the data return in this case should be Row(1, Row(1, 
>1]), as that will be min of struct.

In fact, if I run the spark-shell on my laptop, and I got the result I expected:


./bin/spark-shell
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> case class TestData2(a: Int, b: Int)
defined class TestData2

scala> val testData2DF = sqlContext.sparkContext.parallelize(TestData2(1,1) :: 
TestData2(1,2) :: TestData2(2,1) :: TestData2(2,2) :: TestData2(3,1) :: 
TestData2(3,2) :: Nil, 2).toDF()

scala> val structDF = testData2DF.select("a","b").as("record")

scala> structDF.groupBy($"a").agg(min(struct($"record.*"))).first()
res0: org.apache.spark.sql.Row = [1,[1,1]]

scala> structDF.show
+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  1|  2|
|  2|  1|
|  2|  2|
|  3|  1|
|  3|  2|
+---+---+

So from my spark, which I built on the master, I cannot get Row[3,[1,1]] back 
in this case. Why the unit test asserts that Row[3,[1,1]] should be the first, 
and it will pass? But I cannot reproduce that in my spark-shell? I am trying to 
understand how to interpret the meaning of "agg(min(struct($"record.*")))"


Thanks

Yong



Re: How this unit test passed on master trunk?

2016-04-22 Thread Ted Yu
This was added by Xiao through:

[SPARK-13320][SQL] Support Star in CreateStruct/CreateArray and Error
Handling when DataFrame/DataSet Functions using Star

I tried in spark-shell and got:

scala> val first =
structDf.groupBy($"a").agg(min(struct($"record.*"))).first()
first: org.apache.spark.sql.Row = [1,[1,1]]

BTW
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/715/consoleFull
shows this test passing.

On Fri, Apr 22, 2016 at 11:23 AM, Yong Zhang  wrote:

> Hi,
>
> I was trying to find out why this unit test can pass in Spark code.
>
> in
>
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
>
> for this unit test:
>
>   test("Star Expansion - CreateStruct and CreateArray") {
> val structDf = testData2.select("a", "b").as("record")
> // CreateStruct and CreateArray in aggregateExpressions
> *assert(structDf.groupBy($"a").agg(min(struct($"record.*"))).first() == 
> Row(3, Row(3, 1)))*
> assert(structDf.groupBy($"a").agg(min(array($"record.*"))).first() == 
> Row(3, Seq(3, 1)))
>
> // CreateStruct and CreateArray in project list (unresolved alias)
> assert(structDf.select(struct($"record.*")).first() == Row(Row(1, 1)))
> assert(structDf.select(array($"record.*")).first().getAs[Seq[Int]](0) === 
> Seq(1, 1))
>
> // CreateStruct and CreateArray in project list (alias)
> assert(structDf.select(struct($"record.*").as("a")).first() == Row(Row(1, 
> 1)))
> 
> assert(structDf.select(array($"record.*").as("a")).first().getAs[Seq[Int]](0) 
> === Seq(1, 1))
>   }
>
> From my understanding, the data return in this case should be Row(1, Row(1, 
> 1]), as that will be min of struct.
>
> In fact, if I run the spark-shell on my laptop, and I got the result I 
> expected:
>
>
> ./bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
>   /_/
>
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
>
> scala> case class TestData2(a: Int, b: Int)
> defined class TestData2
>
> scala> val testData2DF = sqlContext.sparkContext.parallelize(TestData2(1,1) 
> :: TestData2(1,2) :: TestData2(2,1) :: TestData2(2,2) :: TestData2(3,1) :: 
> TestData2(3,2) :: Nil, 2).toDF()
>
> scala> val structDF = testData2DF.select("a","b").as("record")
>
> scala> structDF.groupBy($"a").agg(min(struct($"record.*"))).first()
> res0: org.apache.spark.sql.Row = [1,[1,1]]
>
> scala> structDF.show
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  1|
> |  1|  2|
> |  2|  1|
> |  2|  2|
> |  3|  1|
> |  3|  2|
> +---+---+
>
> So from my spark, which I built on the master, I cannot get Row[3,[1,1]] back 
> in this case. Why the unit test asserts that Row[3,[1,1]] should be the 
> first, and it will pass? But I cannot reproduce that in my spark-shell? I am 
> trying to understand how to interpret the meaning of 
> "agg(min(struct($"record.*")))"
>
>
> Thanks
>
> Yong
>
>


How this unit test passed on master trunk?

2016-04-22 Thread Yong Zhang
Hi,
I was trying to find out why this unit test can pass in Spark code.
inhttps://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
for this unit test:
  test("Star Expansion - CreateStruct and CreateArray") {
val structDf = testData2.select("a", "b").as("record")
// CreateStruct and CreateArray in aggregateExpressions
assert(structDf.groupBy($"a").agg(min(struct($"record.*"))).first() == 
Row(3, Row(3, 1)))
assert(structDf.groupBy($"a").agg(min(array($"record.*"))).first() == 
Row(3, Seq(3, 1)))

// CreateStruct and CreateArray in project list (unresolved alias)
assert(structDf.select(struct($"record.*")).first() == Row(Row(1, 1)))
assert(structDf.select(array($"record.*")).first().getAs[Seq[Int]](0) === 
Seq(1, 1))

// CreateStruct and CreateArray in project list (alias)
assert(structDf.select(struct($"record.*").as("a")).first() == Row(Row(1, 
1)))

assert(structDf.select(array($"record.*").as("a")).first().getAs[Seq[Int]](0) 
=== Seq(1, 1))
  }From my understanding, the data return in this case should be Row(1, Row(1, 
1]), as that will be min of struct.In fact, if I run the spark-shell on my 
laptop, and I got the result I expected:
./bin/spark-shell
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> case class TestData2(a: Int, b: Int)
defined class TestData2scala> val testData2DF = 
sqlContext.sparkContext.parallelize(TestData2(1,1) :: TestData2(1,2) :: 
TestData2(2,1) :: TestData2(2,2) :: TestData2(3,1) :: TestData2(3,2) :: Nil, 
2).toDF()scala> val structDF = testData2DF.select("a","b").as("record")scala> 
structDF.groupBy($"a").agg(min(struct($"record.*"))).first()
res0: org.apache.spark.sql.Row = [1,[1,1]]

scala> structDF.show
+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  1|  2|
|  2|  1|
|  2|  2|
|  3|  1|
|  3|  2|
+---+---+So from my spark, which I built on the master, I cannot get 
Row[3,[1,1]] back in this case. Why the unit test asserts that Row[3,[1,1]] 
should be the first, and it will pass? But I cannot reproduce that in my 
spark-shell? I am trying to understand how to interpret the meaning of 
"agg(min(struct($"record.*")))"
ThanksYong