[GitHub] spark pull request #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for Arr...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20984 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for Arr...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20984#discussion_r181339660 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala --- @@ -164,3 +167,46 @@ abstract class ArrayData extends SpecializedGetters with Serializable { } } } + +/** + * Implements an `IndexedSeq` interface for `ArrayData`. Notice that if the original `ArrayData` + * is a primitive array and contains null elements, it is better to ask for `IndexedSeq[Any]`, + * instead of `IndexedSeq[Int]`, in order to keep the null elements. + */ +class ArrayDataIndexedSeq[T](arrayData: ArrayData, dataType: DataType) extends IndexedSeq[T] { + + private def getAccessor(dataType: DataType): (Int) => Any = dataType match { +case BooleanType => (idx: Int) => arrayData.getBoolean(idx) +case ByteType => (idx: Int) => arrayData.getByte(idx) +case ShortType => (idx: Int) => arrayData.getShort(idx) +case IntegerType => (idx: Int) => arrayData.getInt(idx) --- End diff -- I'd like to reuse the access getter in #20981 which covers `DateType` and `TimestampType`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for Arr...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20984#discussion_r181140913 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/ArrayDataIndexedSeqSuite.scala --- @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.catalyst.expressions.UnsafeArrayData +import org.apache.spark.sql.types.StringType +import org.apache.spark.unsafe.types.UTF8String + +class ArrayDataIndexedSeqSuite extends SparkFunSuite { + def utf8(str: String): UTF8String = UTF8String.fromString(str) + val stringArray = Array("1", "10", "100", null) + + private def testArrayData(arrayData: ArrayData): Unit = { +assert(arrayData.numElements == stringArray.length) +stringArray.zipWithIndex.map { case (e, i) => + if (e != null) { +assert(arrayData.getUTF8String(i).toString().equals(e)) + } else { +assert(arrayData.isNullAt(i)) + } +} + +val seq = arrayData.toSeq[UTF8String](StringType) +stringArray.zipWithIndex.map { case (e, i) => + if (e != null) { +assert(seq(i).toString().equals(e)) + } else { +assert(seq(i) == null) + } +} + +intercept[IndexOutOfBoundsException] { + seq(-1) +}.getMessage().contains("must be between 0 and the length of the ArrayData.") + +intercept[IndexOutOfBoundsException] { + seq(seq.length) +}.getMessage().contains("must be between 0 and the length of the ArrayData.") + } + + test("ArrayDataIndexedSeq can work on GenericArrayData") { +val arrayData = new GenericArrayData(stringArray.map(utf8(_))) +testArrayData(arrayData) + } + + test("ArrayDataIndexedSeq can work on UnsafeArrayData") { +val unsafeArrayData = ExpressionEncoder[Array[String]].resolveAndBind(). + toRow(stringArray).getArray(0) +assert(unsafeArrayData.isInstanceOf[UnsafeArrayData]) +testArrayData(unsafeArrayData) --- End diff -- +1 for all type test case. :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for Arr...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20984#discussion_r181140698 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala --- @@ -164,3 +167,46 @@ abstract class ArrayData extends SpecializedGetters with Serializable { } } } + +/** + * Implements an `IndexedSeq` interface for `ArrayData`. Notice that if the original `ArrayData` + * is a primitive array and contains null elements, it is better to ask for `IndexedSeq[Any]`, + * instead of `IndexedSeq[Int]`, in order to keep the null elements. + */ +class ArrayDataIndexedSeq[T](arrayData: ArrayData, dataType: DataType) extends IndexedSeq[T] { + + private def getAccessor(dataType: DataType): (Int) => Any = dataType match { +case BooleanType => (idx: Int) => arrayData.getBoolean(idx) +case ByteType => (idx: Int) => arrayData.getByte(idx) +case ShortType => (idx: Int) => arrayData.getShort(idx) +case IntegerType => (idx: Int) => arrayData.getInt(idx) --- End diff -- `DateType` and `TimestampType`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for Arr...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20984#discussion_r180420612 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala --- @@ -164,3 +167,46 @@ abstract class ArrayData extends SpecializedGetters with Serializable { } } } + +/** + * Implements an `IndexedSeq` interface for `ArrayData`. Notice that if the original `ArrayData` + * is a primitive array and contains null elements, it is better to ask for `IndexedSeq[Any]`, + * instead of `IndexedSeq[Int]`, in order to keep the null elements. + */ +class ArrayDataIndexedSeq[T](arrayData: ArrayData, dataType: DataType) extends IndexedSeq[T] { + + private def getAccessor(dataType: DataType): (Int) => Any = dataType match { --- End diff -- Ok. I will also want to reuse the accessor getter in #20981 too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for Arr...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/20984#discussion_r180364984 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala --- @@ -164,3 +167,46 @@ abstract class ArrayData extends SpecializedGetters with Serializable { } } } + +/** + * Implements an `IndexedSeq` interface for `ArrayData`. Notice that if the original `ArrayData` + * is a primitive array and contains null elements, it is better to ask for `IndexedSeq[Any]`, + * instead of `IndexedSeq[Int]`, in order to keep the null elements. + */ +class ArrayDataIndexedSeq[T](arrayData: ArrayData, dataType: DataType) extends IndexedSeq[T] { + + private def getAccessor(dataType: DataType): (Int) => Any = dataType match { --- End diff -- Suggestion for a small follow-up: We could also use the `accessor` to improve the `foreach` construct. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for Arr...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20984#discussion_r180361913 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala --- @@ -164,3 +167,40 @@ abstract class ArrayData extends SpecializedGetters with Serializable { } } } + +class ArrayDataIndexedSeq[T](arrayData: ArrayData, dataType: DataType) extends IndexedSeq[T] { + + private def getAccessor(dataType: DataType): (Int) => Any = dataType match { +case BooleanType => (idx: Int) => arrayData.getBoolean(idx) +case ByteType => (idx: Int) => arrayData.getByte(idx) +case ShortType => (idx: Int) => arrayData.getShort(idx) +case IntegerType => (idx: Int) => arrayData.getInt(idx) +case LongType => (idx: Int) => arrayData.getLong(idx) +case FloatType => (idx: Int) => arrayData.getFloat(idx) +case DoubleType => (idx: Int) => arrayData.getDouble(idx) +case d: DecimalType => (idx: Int) => arrayData.getDecimal(idx, d.precision, d.scale) +case CalendarIntervalType => (idx: Int) => arrayData.getInterval(idx) +case StringType => (idx: Int) => arrayData.getUTF8String(idx) +case BinaryType => (idx: Int) => arrayData.getBinary(idx) +case s: StructType => (idx: Int) => arrayData.getStruct(idx, s.length) +case _: ArrayType => (idx: Int) => arrayData.getArray(idx) +case _: MapType => (idx: Int) => arrayData.getMap(idx) +case u: UserDefinedType[_] => getAccessor(u.sqlType) +case _ => (idx: Int) => arrayData.get(idx, dataType) + } + + private val accessor: (Int) => Any = getAccessor(dataType) + + override def apply(idx: Int): T = if (0 <= idx && idx < arrayData.numElements()) { --- End diff -- Ok. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for Arr...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/20984#discussion_r180360146 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala --- @@ -164,3 +167,40 @@ abstract class ArrayData extends SpecializedGetters with Serializable { } } } + +class ArrayDataIndexedSeq[T](arrayData: ArrayData, dataType: DataType) extends IndexedSeq[T] { + + private def getAccessor(dataType: DataType): (Int) => Any = dataType match { +case BooleanType => (idx: Int) => arrayData.getBoolean(idx) +case ByteType => (idx: Int) => arrayData.getByte(idx) +case ShortType => (idx: Int) => arrayData.getShort(idx) +case IntegerType => (idx: Int) => arrayData.getInt(idx) +case LongType => (idx: Int) => arrayData.getLong(idx) +case FloatType => (idx: Int) => arrayData.getFloat(idx) +case DoubleType => (idx: Int) => arrayData.getDouble(idx) +case d: DecimalType => (idx: Int) => arrayData.getDecimal(idx, d.precision, d.scale) +case CalendarIntervalType => (idx: Int) => arrayData.getInterval(idx) +case StringType => (idx: Int) => arrayData.getUTF8String(idx) +case BinaryType => (idx: Int) => arrayData.getBinary(idx) +case s: StructType => (idx: Int) => arrayData.getStruct(idx, s.length) +case _: ArrayType => (idx: Int) => arrayData.getArray(idx) +case _: MapType => (idx: Int) => arrayData.getMap(idx) +case u: UserDefinedType[_] => getAccessor(u.sqlType) +case _ => (idx: Int) => arrayData.get(idx, dataType) + } + + private val accessor: (Int) => Any = getAccessor(dataType) + + override def apply(idx: Int): T = if (0 <= idx && idx < arrayData.numElements()) { --- End diff -- NITish: Can you put the if statement on a separate line? This is kinda hard to read. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for Arr...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20984#discussion_r180358272 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/ArrayDataIndexedSeqSuite.scala --- @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.catalyst.expressions.UnsafeArrayData +import org.apache.spark.sql.types.StringType +import org.apache.spark.unsafe.types.UTF8String + +class ArrayDataIndexedSeqSuite extends SparkFunSuite { + def utf8(str: String): UTF8String = UTF8String.fromString(str) + val stringArray = Array("1", "10", "100", null) + + private def testArrayData(arrayData: ArrayData): Unit = { +assert(arrayData.numElements == stringArray.length) +stringArray.zipWithIndex.map { case (e, i) => + if (e != null) { +assert(arrayData.getUTF8String(i).toString().equals(e)) + } else { +assert(arrayData.isNullAt(i)) + } +} + +val seq = arrayData.toSeq[UTF8String](StringType) +stringArray.zipWithIndex.map { case (e, i) => + if (e != null) { +assert(seq(i).toString().equals(e)) + } else { +assert(seq(i) == null) + } +} + +intercept[IndexOutOfBoundsException] { + seq(-1) +}.getMessage().contains("must be between 0 and the length of the ArrayData.") + +intercept[IndexOutOfBoundsException] { + seq(seq.length) +}.getMessage().contains("must be between 0 and the length of the ArrayData.") + } + + test("ArrayDataIndexedSeq can work on GenericArrayData") { +val arrayData = new GenericArrayData(stringArray.map(utf8(_))) +testArrayData(arrayData) + } + + test("ArrayDataIndexedSeq can work on UnsafeArrayData") { +val unsafeArrayData = ExpressionEncoder[Array[String]].resolveAndBind(). + toRow(stringArray).getArray(0) +assert(unsafeArrayData.isInstanceOf[UnsafeArrayData]) +testArrayData(unsafeArrayData) --- End diff -- This test may need polish to test again all possible types. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for Arr...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20984#discussion_r180357829 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala --- @@ -164,3 +167,37 @@ abstract class ArrayData extends SpecializedGetters with Serializable { } } } + +class ArrayDataIndexedSeq[T](arrayData: ArrayData, dataType: DataType) extends IndexedSeq[T] { + + private lazy val accessor: (Int) => Any = dataType match { +case BooleanType => (idx: Int) => arrayData.getBoolean(idx) +case ByteType => (idx: Int) => arrayData.getByte(idx) +case ShortType => (idx: Int) => arrayData.getShort(idx) +case IntegerType => (idx: Int) => arrayData.getInt(idx) +case LongType => (idx: Int) => arrayData.getLong(idx) +case FloatType => (idx: Int) => arrayData.getFloat(idx) +case DoubleType => (idx: Int) => arrayData.getDouble(idx) +case d: DecimalType => (idx: Int) => arrayData.getDecimal(idx, d.precision, d.scale) +case CalendarIntervalType => (idx: Int) => arrayData.getInterval(idx) +case StringType => (idx: Int) => arrayData.getUTF8String(idx) +case BinaryType => (idx: Int) => arrayData.getBinary(idx) +case s: StructType => (idx: Int) => arrayData.getStruct(idx, s.length) +case _: ArrayType => (idx: Int) => arrayData.getArray(idx) +case _: MapType => (idx: Int) => arrayData.getMap(idx) +case _ => (idx: Int) => arrayData.get(idx, dataType) + } + + override def apply(idx: Int): T = if (0 <= idx && idx < arrayData.numElements()) { +if (arrayData.isNullAt(idx)) { + null.asInstanceOf[T] --- End diff -- For primitive ArrayData, this seems to be an issue for null elements... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for Arr...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20984#discussion_r180292435 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala --- @@ -164,3 +167,32 @@ abstract class ArrayData extends SpecializedGetters with Serializable { } } } + +class ArrayDataIndexedSeq[T](arrayData: ArrayData, dataType: DataType) extends IndexedSeq[T] { + + private lazy val accessor: (Int) => Any = dataType match { +case BooleanType => (idx: Int) => arrayData.getBoolean(idx) +case ByteType => (idx: Int) => arrayData.getByte(idx) +case ShortType => (idx: Int) => arrayData.getShort(idx) +case IntegerType => (idx: Int) => arrayData.getInt(idx) +case LongType => (idx: Int) => arrayData.getLong(idx) +case FloatType => (idx: Int) => arrayData.getFloat(idx) +case DoubleType => (idx: Int) => arrayData.getDouble(idx) +case d: DecimalType => (idx: Int) => arrayData.getDecimal(idx, d.precision, d.scale) +case CalendarIntervalType => (idx: Int) => arrayData.getInterval(idx) +case StringType => (idx: Int) => arrayData.getUTF8String(idx) +case BinaryType => (idx: Int) => arrayData.getBinary(idx) +case s: StructType => (idx: Int) => arrayData.getStruct(idx, s.length) +case _: ArrayType => (idx: Int) => arrayData.getArray(idx) +case _: MapType => (idx: Int) => arrayData.getMap(idx) +case _ => (idx: Int) => arrayData.get(idx, dataType) + } + + override def apply(idx: Int): T = if (idx < arrayData.numElements()) { --- End diff -- Thanks. Added the check now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for Arr...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20984#discussion_r179920210 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala --- @@ -164,3 +167,32 @@ abstract class ArrayData extends SpecializedGetters with Serializable { } } } + +class ArrayDataIndexedSeq[T](arrayData: ArrayData, dataType: DataType) extends IndexedSeq[T] { + + private lazy val accessor: (Int) => Any = dataType match { +case BooleanType => (idx: Int) => arrayData.getBoolean(idx) +case ByteType => (idx: Int) => arrayData.getByte(idx) +case ShortType => (idx: Int) => arrayData.getShort(idx) +case IntegerType => (idx: Int) => arrayData.getInt(idx) +case LongType => (idx: Int) => arrayData.getLong(idx) +case FloatType => (idx: Int) => arrayData.getFloat(idx) +case DoubleType => (idx: Int) => arrayData.getDouble(idx) +case d: DecimalType => (idx: Int) => arrayData.getDecimal(idx, d.precision, d.scale) +case CalendarIntervalType => (idx: Int) => arrayData.getInterval(idx) +case StringType => (idx: Int) => arrayData.getUTF8String(idx) +case BinaryType => (idx: Int) => arrayData.getBinary(idx) +case s: StructType => (idx: Int) => arrayData.getStruct(idx, s.length) +case _: ArrayType => (idx: Int) => arrayData.getArray(idx) +case _: MapType => (idx: Int) => arrayData.getMap(idx) +case _ => (idx: Int) => arrayData.get(idx, dataType) + } + + override def apply(idx: Int): T = if (idx < arrayData.numElements()) { --- End diff -- Do we need a check `0 <= idx`, too? If so, it would be good to update a message in the exception. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for Arr...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/20984 [SPARK-23875][SQL] Add IndexedSeq wrapper for ArrayData ## What changes were proposed in this pull request? We don't have a good way to sequentially access `UnsafeArrayData` with a common interface such as `Seq`. An example is `MapObject` where we need to access several sequence collection types together. But `UnsafeArrayData` doesn't implement `ArrayData.array`. Calling `toArray` will copy the entire array. We can provide an `IndexedSeq` wrapper for `ArrayData`, so we can avoid copying the entire array. ## How was this patch tested? TBD. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 SPARK-23875 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20984.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20984 commit 27eddd764bc60f9f41156c3548e98e6aea68fd02 Author: Liang-Chi HsiehDate: 2018-04-05T14:33:39Z Add IndexedSeq wrapper for ArrayData. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org