[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19603 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19603#discussion_r147589361 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -591,18 +591,40 @@ case class MapObjects private( case _ => inputData.dataType } -val (getLength, getLoopVar) = inputDataType match { +val (getLength, prepareLoop, getLoopVar) = inputDataType match { case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) => -s"${genInputData.value}.size()" -> s"${genInputData.value}.apply($loopIndex)" +val it = ctx.freshName("it") +( + s"${genInputData.value}.size()", --- End diff -- I see. got it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19603#discussion_r147589355 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -591,6 +591,9 @@ case class MapObjects private( case _ => inputData.dataType } +// `MapObjects` generates a while loop to traverse the elements of the input collection. We +// need to take care of Seq and List because they may have O(n) complexity for indexed accessing +// like `list.get(1)`. Here we use Iterator to travers Seq and List. --- End diff -- nit: travers -> traverse --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19603#discussion_r147587911 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -591,18 +591,40 @@ case class MapObjects private( case _ => inputData.dataType } -val (getLength, getLoopVar) = inputDataType match { +val (getLength, prepareLoop, getLoopVar) = inputDataType match { case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) => -s"${genInputData.value}.size()" -> s"${genInputData.value}.apply($loopIndex)" +val it = ctx.freshName("it") +( + s"${genInputData.value}.size()", --- End diff -- otherwise we need a re-sizable array to keep result, which is a lot of change and doesn't have a clear win(re-sizing is expensive). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19603#discussion_r147581601 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -591,18 +591,40 @@ case class MapObjects private( case _ => inputData.dataType } -val (getLength, getLoopVar) = inputDataType match { +val (getLength, prepareLoop, getLoopVar) = inputDataType match { case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) => -s"${genInputData.value}.size()" -> s"${genInputData.value}.apply($loopIndex)" +val it = ctx.freshName("it") +( + s"${genInputData.value}.size()", --- End diff -- Is it OK with us for `size()` to take `O(n)` in some implementations of `Seq`? Since this is out of the loop, it is not the worst, but it is not good. It would be better to iterate a loop using `hasNext()`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19603#discussion_r147581548 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -591,18 +591,40 @@ case class MapObjects private( case _ => inputData.dataType } -val (getLength, getLoopVar) = inputDataType match { +val (getLength, prepareLoop, getLoopVar) = inputDataType match { case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) => -s"${genInputData.value}.size()" -> s"${genInputData.value}.apply($loopIndex)" +val it = ctx.freshName("it") --- End diff -- I agreed. Currently the branches are a bit complicated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19603#discussion_r147581528 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -591,18 +591,40 @@ case class MapObjects private( case _ => inputData.dataType } -val (getLength, getLoopVar) = inputDataType match { +val (getLength, prepareLoop, getLoopVar) = inputDataType match { case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) => --- End diff -- Oops, right, not reading this clearly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19603#discussion_r147581519 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -591,18 +591,40 @@ case class MapObjects private( case _ => inputData.dataType } -val (getLength, getLoopVar) = inputDataType match { +val (getLength, prepareLoop, getLoopVar) = inputDataType match { case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) => -s"${genInputData.value}.size()" -> s"${genInputData.value}.apply($loopIndex)" +val it = ctx.freshName("it") +( + s"${genInputData.value}.size()", + s"scala.collection.Iterator $it = ${genInputData.value}.toIterator();", --- End diff -- I think it is better to leave a comment for code readers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19603#discussion_r147580091 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -591,18 +591,40 @@ case class MapObjects private( case _ => inputData.dataType } -val (getLength, getLoopVar) = inputDataType match { +val (getLength, prepareLoop, getLoopVar) = inputDataType match { case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) => -s"${genInputData.value}.size()" -> s"${genInputData.value}.apply($loopIndex)" +val it = ctx.freshName("it") +( + s"${genInputData.value}.size()", + s"scala.collection.Iterator $it = ${genInputData.value}.toIterator();", + s"$it.next()" +) case ObjectType(cls) if cls.isArray => -s"${genInputData.value}.length" -> s"${genInputData.value}[$loopIndex]" +( + s"${genInputData.value}.length", + "", + s"${genInputData.value}[$loopIndex]" +) case ObjectType(cls) if classOf[java.util.List[_]].isAssignableFrom(cls) => --- End diff -- The legal external typea for catalyst array are scala `Seq` and java `List`, so we don't need to match a more general type here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19603#discussion_r147580056 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -591,18 +591,40 @@ case class MapObjects private( case _ => inputData.dataType } -val (getLength, getLoopVar) = inputDataType match { +val (getLength, prepareLoop, getLoopVar) = inputDataType match { case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) => -s"${genInputData.value}.size()" -> s"${genInputData.value}.apply($loopIndex)" +val it = ctx.freshName("it") --- End diff -- That's not a big deal, we are traversing the collection from start to end, so either Iterator or accessing by index won't have much difference, may not worth to create 2 branches here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19603#discussion_r147579950 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -591,18 +591,40 @@ case class MapObjects private( case _ => inputData.dataType } -val (getLength, getLoopVar) = inputDataType match { +val (getLength, prepareLoop, getLoopVar) = inputDataType match { case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) => --- End diff -- you can use `seqObj.isInstanceOf[Seq[_]]`, but not a `Class` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19603#discussion_r147579883 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -591,18 +591,40 @@ case class MapObjects private( case _ => inputData.dataType } -val (getLength, getLoopVar) = inputDataType match { +val (getLength, prepareLoop, getLoopVar) = inputDataType match { case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) => --- End diff -- Also not sure if you want to change this here, but think this is more direct as `cls.isInstanceOf[Seq[_]]` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19603#discussion_r147579873 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -591,18 +591,40 @@ case class MapObjects private( case _ => inputData.dataType } -val (getLength, getLoopVar) = inputDataType match { +val (getLength, prepareLoop, getLoopVar) = inputDataType match { case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) => -s"${genInputData.value}.size()" -> s"${genInputData.value}.apply($loopIndex)" +val it = ctx.freshName("it") --- End diff -- This would cause some types of `Seq` to be traversed by iterator when they support constant-time random access. That's not so bad. But I think you could make a rule for `IndexedSeq` that takes precedence that retains the old behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19603#discussion_r147579666 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -591,18 +591,40 @@ case class MapObjects private( case _ => inputData.dataType } -val (getLength, getLoopVar) = inputDataType match { +val (getLength, prepareLoop, getLoopVar) = inputDataType match { case ObjectType(cls) if classOf[Seq[_]].isAssignableFrom(cls) => -s"${genInputData.value}.size()" -> s"${genInputData.value}.apply($loopIndex)" +val it = ctx.freshName("it") +( + s"${genInputData.value}.size()", + s"scala.collection.Iterator $it = ${genInputData.value}.toIterator();", + s"$it.next()" +) case ObjectType(cls) if cls.isArray => -s"${genInputData.value}.length" -> s"${genInputData.value}[$loopIndex]" +( + s"${genInputData.value}.length", + "", + s"${genInputData.value}[$loopIndex]" +) case ObjectType(cls) if classOf[java.util.List[_]].isAssignableFrom(cls) => --- End diff -- Not sure you want to change this here, or I could be missing something, but really any `java.util.Collection` has `iterator()` and `size()`. It need not be restrict to `java.util.List`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19603: [SPARK-22385][SQL] MapObjects should not access l...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/19603 [SPARK-22385][SQL] MapObjects should not access list element by index ## What changes were proposed in this pull request? This issue was discovered and investigated by Ohad Raviv and Sean Owen in https://issues.apache.org/jira/browse/SPARK-21657. The input data of `MapObjects` may be a `List` which has O(n) complexity for accessing by index. When converting input data to catalyst array, `MapObjects` gets element by index in each loop, and results to bad performance. This PR fixes this issue by accessing elements via Iterator. ## How was this patch tested? using the test script in https://issues.apache.org/jira/browse/SPARK-21657 ``` val BASE = 1 val N = 10 val df = sc.parallelize(List(("1234567890", (BASE to (BASE+N)).map(x => (x.toString, (x+1).toString, (x+2).toString, (x+3).toString)).toList ))).toDF("c1", "c_arr") spark.time(df.queryExecution.toRdd.foreach(_ => ())) ``` We can see 50x speed up. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark map-objects Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19603.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19603 commit 6cb4fca89e83172407114037d3a447ae6d941f0a Author: Wenchen FanDate: 2017-10-29T11:27:21Z MapObjects should not access list element by index --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org