[ https://issues.apache.org/jira/browse/SPARK-12911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115459#comment-15115459 ]
Stephen DiCocco commented on SPARK-12911: ----------------------------------------- So we have determined one way to work around the issue is to add the array you want to search for as a literal column on the dataframe and then cache the frame. This causes the underlying types of both to be UnsafeArrayData. {code} test("test array comparison") { val vectors: Vector[Row] = Vector( Row.fromTuple("id_1" -> Array(0L, 2L)), Row.fromTuple("id_2" -> Array(0L, 5L)), Row.fromTuple("id_3" -> Array(0L, 9L)), Row.fromTuple("id_4" -> Array(1L, 0L)), Row.fromTuple("id_5" -> Array(1L, 8L)), Row.fromTuple("id_6" -> Array(2L, 4L)), Row.fromTuple("id_7" -> Array(5L, 6L)), Row.fromTuple("id_8" -> Array(6L, 2L)), Row.fromTuple("id_9" -> Array(7L, 0L)) ) val data: RDD[Row] = sc.parallelize(vectors, 3) val schema = StructType( StructField("id", StringType, false) :: StructField("point", DataTypes.createArrayType(LongType, false), false) :: Nil ) val sqlContext = new SQLContext(sc) val dataframe = sqlContext.createDataFrame(data, schema) val targetPoint:Array[Long] = Array(0L,9L) //Adding the target column to the frame allows you to do the comparison successfully but there is definite overhead to doing this!!!! dataframe = dataframe.withColumn("target", array(targetPoint.map(value => lit(value)): _*)) dataframe.cache() //This is the line where it fails //java.util.NoSuchElementException: next on empty iterator //However we know that there is a valid match val targetRow = dataframe.where(dataframe("point") === dataframe("target").first() assert(targetRow != null) } {code} > Cacheing a dataframe causes array comparisons to fail (in filter / where) > after 1.6 > ----------------------------------------------------------------------------------- > > Key: SPARK-12911 > URL: https://issues.apache.org/jira/browse/SPARK-12911 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.0 > Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.6.0 > Reporter: Jesse English > > When doing a *where* operation on a dataframe and testing for equality on an > array type, after 1.6 no valid comparisons are made if the dataframe has been > cached. If it has not been cached, the results are as expected. > This appears to be related to the underlying unsafe array data types. > {code:title=test.scala|borderStyle=solid} > test("test array comparison") { > val vectors: Vector[Row] = Vector( > Row.fromTuple("id_1" -> Array(0L, 2L)), > Row.fromTuple("id_2" -> Array(0L, 5L)), > Row.fromTuple("id_3" -> Array(0L, 9L)), > Row.fromTuple("id_4" -> Array(1L, 0L)), > Row.fromTuple("id_5" -> Array(1L, 8L)), > Row.fromTuple("id_6" -> Array(2L, 4L)), > Row.fromTuple("id_7" -> Array(5L, 6L)), > Row.fromTuple("id_8" -> Array(6L, 2L)), > Row.fromTuple("id_9" -> Array(7L, 0L)) > ) > val data: RDD[Row] = sc.parallelize(vectors, 3) > val schema = StructType( > StructField("id", StringType, false) :: > StructField("point", DataTypes.createArrayType(LongType, false), > false) :: > Nil > ) > val sqlContext = new SQLContext(sc) > val dataframe = sqlContext.createDataFrame(data, schema) > val targetPoint:Array[Long] = Array(0L,9L) > //Cacheing is the trigger to cause the error (no cacheing causes no error) > dataframe.cache() > //This is the line where it fails > //java.util.NoSuchElementException: next on empty iterator > //However we know that there is a valid match > val targetRow = dataframe.where(dataframe("point") === > array(targetPoint.map(value => lit(value)): _*)).first() > assert(targetRow != null) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org