Hello,

I am using Spark 2.0 for performing filtering, grouping and counting
operations on events data saved in parquet files. As the events schema has
very nested structure I wanted to read them as scala beans to simplify the
code but I noticed a severe performance degradation. Below you can find
simple comparison of the same operation using DataFrame and Dataset.

val data = session.read.parquet("events_data/")

// Using Datasets with custom class

//Case class matching the events schema

case class CustomEvent(event_id: Option[String],

event_type: Option[String]
           context : Option[Context],

….
                       time: Option[BigInt]) extends Serializable {}

scala> val start = System.currentTimeMillis ;

          val count = data.as[CustomEvent].filter(event =>
eventNames.contains(event.event_type.get)).count ;

         val time =  System.currentTimeMillis - start

count: Long = 5545

time: Long = 11262

// Using DataFrames

scala>

val start = System.currentTimeMillis ;

val count = data.filter(col("event_type").isin(eventNames:_*)).count ;

val time =  System.currentTimeMillis - start

count: Long = 5545

time: Long = 147

The schema of the events is something like this:

//events schma

schemaroot

|-- event_id: string (nullable = true)

|-- event_type: string (nullable = true)

|-- context: struct (nullable = true)

|    |-- environment_1: struct (nullable = true)

|    |    |-- filed1: integer (nullable = true)

|    |    |-- filed2: integer (nullable = true)

|    |    |-- filed3: integer (nullable = true)

|    |-- environment_2: struct (nullable = true)

|    |    |-- filed_1: string (nullable = true)

....

|    |    |-- filed_n: string (nullable = true)

|-- metadata: struct (nullable = true)

|    |-- elements: array (nullable = true)

|    |    |-- element: struct (containsNull = true)

|    |    |    |-- config: string (nullable = true)

|    |    |    |-- tree: array (nullable = true)

|    |    |    |    |-- element: struct (containsNull = true)

|    |    |    |    |    |-- path: array (nullable = true)

|    |    |    |    |    |    |-- element: struct (containsNull = true)

|    |    |    |    |    |    |    |-- key: string (nullable = true)

|    |    |    |    |    |    |    |-- name: string (nullable = true)

|    |    |    |    |    |    |    |-- level: integer (nullable = true)

|-- time: long (nullable = true)

Could you please advise me on the usage of the different abstractions and
help me understand why using datasets with user defined class is so much
slower.

Thank you,
Antoaneta

Reply via email to