Hello, I am using Spark 2.0 for performing filtering, grouping and counting operations on events data saved in parquet files. As the events schema has very nested structure I wanted to read them as scala beans to simplify the code but I noticed a severe performance degradation. Below you can find simple comparison of the same operation using DataFrame and Dataset.
val data = session.read.parquet("events_data/") // Using Datasets with custom class //Case class matching the events schema case class CustomEvent(event_id: Option[String], event_type: Option[String] context : Option[Context], …. time: Option[BigInt]) extends Serializable {} scala> val start = System.currentTimeMillis ; val count = data.as[CustomEvent].filter(event => eventNames.contains(event.event_type.get)).count ; val time = System.currentTimeMillis - start count: Long = 5545 time: Long = 11262 // Using DataFrames scala> val start = System.currentTimeMillis ; val count = data.filter(col("event_type").isin(eventNames:_*)).count ; val time = System.currentTimeMillis - start count: Long = 5545 time: Long = 147 The schema of the events is something like this: //events schma schemaroot |-- event_id: string (nullable = true) |-- event_type: string (nullable = true) |-- context: struct (nullable = true) | |-- environment_1: struct (nullable = true) | | |-- filed1: integer (nullable = true) | | |-- filed2: integer (nullable = true) | | |-- filed3: integer (nullable = true) | |-- environment_2: struct (nullable = true) | | |-- filed_1: string (nullable = true) .... | | |-- filed_n: string (nullable = true) |-- metadata: struct (nullable = true) | |-- elements: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- config: string (nullable = true) | | | |-- tree: array (nullable = true) | | | | |-- element: struct (containsNull = true) | | | | | |-- path: array (nullable = true) | | | | | | |-- element: struct (containsNull = true) | | | | | | | |-- key: string (nullable = true) | | | | | | | |-- name: string (nullable = true) | | | | | | | |-- level: integer (nullable = true) |-- time: long (nullable = true) Could you please advise me on the usage of the different abstractions and help me understand why using datasets with user defined class is so much slower. Thank you, Antoaneta