[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...

2017-10-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19603


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...

2017-10-29 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19603#discussion_r147589361
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -591,18 +591,40 @@ case class MapObjects private(
   case _ => inputData.dataType
 }
 
-val (getLength, getLoopVar) = inputDataType match {
+val (getLength, prepareLoop, getLoopVar) = inputDataType match {
   case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) =>
-s"${genInputData.value}.size()" -> 
s"${genInputData.value}.apply($loopIndex)"
+val it = ctx.freshName("it")
+(
+  s"${genInputData.value}.size()",
--- End diff --

I see. got it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...

2017-10-29 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19603#discussion_r147589355
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -591,6 +591,9 @@ case class MapObjects private(
   case _ => inputData.dataType
 }
 
+// `MapObjects` generates a while loop to traverse the elements of the 
input collection. We
+// need to take care of Seq and List because they may have O(n) 
complexity for indexed accessing
+// like `list.get(1)`. Here we use Iterator to travers Seq and List.
--- End diff --

nit: travers -> traverse


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...

2017-10-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19603#discussion_r147587911
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -591,18 +591,40 @@ case class MapObjects private(
   case _ => inputData.dataType
 }
 
-val (getLength, getLoopVar) = inputDataType match {
+val (getLength, prepareLoop, getLoopVar) = inputDataType match {
   case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) =>
-s"${genInputData.value}.size()" -> 
s"${genInputData.value}.apply($loopIndex)"
+val it = ctx.freshName("it")
+(
+  s"${genInputData.value}.size()",
--- End diff --

otherwise we need a re-sizable array to keep result, which is a lot of 
change and doesn't have a clear win(re-sizing is expensive).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...

2017-10-29 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19603#discussion_r147581601
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -591,18 +591,40 @@ case class MapObjects private(
   case _ => inputData.dataType
 }
 
-val (getLength, getLoopVar) = inputDataType match {
+val (getLength, prepareLoop, getLoopVar) = inputDataType match {
   case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) =>
-s"${genInputData.value}.size()" -> 
s"${genInputData.value}.apply($loopIndex)"
+val it = ctx.freshName("it")
+(
+  s"${genInputData.value}.size()",
--- End diff --

Is it OK with us for `size()` to take `O(n)` in some implementations of 
`Seq`?
Since this is out of the loop, it is not the worst, but it is not good. It 
would be better to iterate a loop using `hasNext()`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...

2017-10-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19603#discussion_r147581548
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -591,18 +591,40 @@ case class MapObjects private(
   case _ => inputData.dataType
 }
 
-val (getLength, getLoopVar) = inputDataType match {
+val (getLength, prepareLoop, getLoopVar) = inputDataType match {
   case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) =>
-s"${genInputData.value}.size()" -> 
s"${genInputData.value}.apply($loopIndex)"
+val it = ctx.freshName("it")
--- End diff --

I agreed. Currently the branches are a bit complicated.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...

2017-10-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19603#discussion_r147581528
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -591,18 +591,40 @@ case class MapObjects private(
   case _ => inputData.dataType
 }
 
-val (getLength, getLoopVar) = inputDataType match {
+val (getLength, prepareLoop, getLoopVar) = inputDataType match {
   case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) =>
--- End diff --

Oops, right, not reading this clearly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...

2017-10-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19603#discussion_r147581519
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -591,18 +591,40 @@ case class MapObjects private(
   case _ => inputData.dataType
 }
 
-val (getLength, getLoopVar) = inputDataType match {
+val (getLength, prepareLoop, getLoopVar) = inputDataType match {
   case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) =>
-s"${genInputData.value}.size()" -> 
s"${genInputData.value}.apply($loopIndex)"
+val it = ctx.freshName("it")
+(
+  s"${genInputData.value}.size()",
+  s"scala.collection.Iterator $it = 
${genInputData.value}.toIterator();",
--- End diff --

I think it is better to leave a comment for code readers.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...

2017-10-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19603#discussion_r147580091
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -591,18 +591,40 @@ case class MapObjects private(
   case _ => inputData.dataType
 }
 
-val (getLength, getLoopVar) = inputDataType match {
+val (getLength, prepareLoop, getLoopVar) = inputDataType match {
   case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) =>
-s"${genInputData.value}.size()" -> 
s"${genInputData.value}.apply($loopIndex)"
+val it = ctx.freshName("it")
+(
+  s"${genInputData.value}.size()",
+  s"scala.collection.Iterator $it = 
${genInputData.value}.toIterator();",
+  s"$it.next()"
+)
   case ObjectType(cls) if cls.isArray =>
-s"${genInputData.value}.length" -> 
s"${genInputData.value}[$loopIndex]"
+(
+  s"${genInputData.value}.length",
+  "",
+  s"${genInputData.value}[$loopIndex]"
+)
   case ObjectType(cls) if 
classOf[java.util.List[_]].isAssignableFrom(cls) =>
--- End diff --

The legal external typea for catalyst array are scala `Seq` and java 
`List`, so we don't need to match a more general type here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...

2017-10-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19603#discussion_r147580056
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -591,18 +591,40 @@ case class MapObjects private(
   case _ => inputData.dataType
 }
 
-val (getLength, getLoopVar) = inputDataType match {
+val (getLength, prepareLoop, getLoopVar) = inputDataType match {
   case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) =>
-s"${genInputData.value}.size()" -> 
s"${genInputData.value}.apply($loopIndex)"
+val it = ctx.freshName("it")
--- End diff --

That's not a big deal, we are traversing the collection from start to end, 
so either Iterator or accessing by index won't have much difference, may not 
worth to create 2 branches here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...

2017-10-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19603#discussion_r147579950
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -591,18 +591,40 @@ case class MapObjects private(
   case _ => inputData.dataType
 }
 
-val (getLength, getLoopVar) = inputDataType match {
+val (getLength, prepareLoop, getLoopVar) = inputDataType match {
   case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) =>
--- End diff --

you can use `seqObj.isInstanceOf[Seq[_]]`, but not a `Class`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...

2017-10-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19603#discussion_r147579883
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -591,18 +591,40 @@ case class MapObjects private(
   case _ => inputData.dataType
 }
 
-val (getLength, getLoopVar) = inputDataType match {
+val (getLength, prepareLoop, getLoopVar) = inputDataType match {
   case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) =>
--- End diff --

Also not sure if you want to change this here, but think this is more 
direct as `cls.isInstanceOf[Seq[_]]`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...

2017-10-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19603#discussion_r147579873
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -591,18 +591,40 @@ case class MapObjects private(
   case _ => inputData.dataType
 }
 
-val (getLength, getLoopVar) = inputDataType match {
+val (getLength, prepareLoop, getLoopVar) = inputDataType match {
   case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) =>
-s"${genInputData.value}.size()" -> 
s"${genInputData.value}.apply($loopIndex)"
+val it = ctx.freshName("it")
--- End diff --

This would cause some types of `Seq` to be traversed by iterator when they 
support constant-time random access. That's not so bad. But I think you could 
make a rule for `IndexedSeq` that takes precedence that retains the old 
behavior. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...

2017-10-29 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19603#discussion_r147579666
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -591,18 +591,40 @@ case class MapObjects private(
   case _ => inputData.dataType
 }
 
-val (getLength, getLoopVar) = inputDataType match {
+val (getLength, prepareLoop, getLoopVar) = inputDataType match {
   case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) =>
-s"${genInputData.value}.size()" -> 
s"${genInputData.value}.apply($loopIndex)"
+val it = ctx.freshName("it")
+(
+  s"${genInputData.value}.size()",
+  s"scala.collection.Iterator $it = 
${genInputData.value}.toIterator();",
+  s"$it.next()"
+)
   case ObjectType(cls) if cls.isArray =>
-s"${genInputData.value}.length" -> 
s"${genInputData.value}[$loopIndex]"
+(
+  s"${genInputData.value}.length",
+  "",
+  s"${genInputData.value}[$loopIndex]"
+)
   case ObjectType(cls) if 
classOf[java.util.List[_]].isAssignableFrom(cls) =>
--- End diff --

Not sure you want to change this here, or I could be missing something, but 
really any `java.util.Collection` has `iterator()` and `size()`.  It need not 
be restrict to `java.util.List`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...

2017-10-29 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/19603

[SPARK-22385][SQL] MapObjects should not access list element by index

## What changes were proposed in this pull request?

This issue was discovered and investigated by Ohad Raviv and Sean Owen in 
https://issues.apache.org/jira/browse/SPARK-21657. The input data of 
`MapObjects` may be a `List` which has O(n) complexity for accessing by index. 
When converting input data to catalyst array, `MapObjects` gets element by 
index in each loop, and results to bad performance.

This PR fixes this issue by accessing elements via Iterator.

## How was this patch tested?

using the test script in https://issues.apache.org/jira/browse/SPARK-21657 
```
val BASE = 1
val N = 10
val df = sc.parallelize(List(("1234567890", (BASE to (BASE+N)).map(x => 
(x.toString, (x+1).toString, (x+2).toString, (x+3).toString)).toList 
))).toDF("c1", "c_arr")
spark.time(df.queryExecution.toRdd.foreach(_ => ()))
```

We can see 50x speed up.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark map-objects

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19603.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19603


commit 6cb4fca89e83172407114037d3a447ae6d941f0a
Author: Wenchen Fan 
Date:   2017-10-29T11:27:21Z

MapObjects should not access list element by index




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org