[jira] [Commented] (SPARK-12911) Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6

Andrew Ray (JIRA) Wed, 20 Jan 2016 11:49:03 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-12911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109269#comment-15109269
 ]


Andrew Ray commented on SPARK-12911:
------------------------------------

In the current master this happens even without caching. The generated code 
calls UnsafeArrayData.equals for each row with an argument of type 
GenericArrayData (the filter value) which returns false. Presumably the 
generated code needs to convert the filter value to an UnsafeArrayData but I'm 
at a loss for where to do so.

[~sdicocco] This is a bug, not something a developer should have to worry about.

> Cacheing a dataframe causes array comparisons to fail (in filter / where) 
> after 1.6
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-12911
>                 URL: https://issues.apache.org/jira/browse/SPARK-12911
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>         Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.6.0
>            Reporter: Jesse English
>
> When doing a *where* operation on a dataframe and testing for equality on an 
> array type, after 1.6 no valid comparisons are made if the dataframe has been 
> cached.  If it has not been cached, the results are as expected.
> This appears to be related to the underlying unsafe array data types.
> {code:title=test.scala|borderStyle=solid}
> test("test array comparison") {
>     val vectors: Vector[Row] =  Vector(
>       Row.fromTuple("id_1" -> Array(0L, 2L)),
>       Row.fromTuple("id_2" -> Array(0L, 5L)),
>       Row.fromTuple("id_3" -> Array(0L, 9L)),
>       Row.fromTuple("id_4" -> Array(1L, 0L)),
>       Row.fromTuple("id_5" -> Array(1L, 8L)),
>       Row.fromTuple("id_6" -> Array(2L, 4L)),
>       Row.fromTuple("id_7" -> Array(5L, 6L)),
>       Row.fromTuple("id_8" -> Array(6L, 2L)),
>       Row.fromTuple("id_9" -> Array(7L, 0L))
>     )
>     val data: RDD[Row] = sc.parallelize(vectors, 3)
>     val schema = StructType(
>       StructField("id", StringType, false) ::
>         StructField("point", DataTypes.createArrayType(LongType, false), 
> false) ::
>         Nil
>     )
>     val sqlContext = new SQLContext(sc)
>     val dataframe = sqlContext.createDataFrame(data, schema)
>     val targetPoint:Array[Long] = Array(0L,9L)
>     //Cacheing is the trigger to cause the error (no cacheing causes no error)
>     dataframe.cache()
>     //This is the line where it fails
>     //java.util.NoSuchElementException: next on empty iterator
>     //However we know that there is a valid match
>     val targetRow = dataframe.where(dataframe("point") === 
> array(targetPoint.map(value => lit(value)): _*)).first()
>     assert(targetRow != null)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12911) Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6

Reply via email to