DataFrames :: Corrupted Data

Sergey Zhemzhitsky Wed, 28 Mar 2018 11:25:53 -0700

Hello guys,

I'm using Spark 2.2.0 and from time to time my job fails printing into
the log the following errors


scala.MatchError:
profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@
scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)

The job itself looks like the following and contains a few shuffles and UDAFs

val df = spark.read.avro(...).as[...]
      .groupBy(...)
      .agg(collect_list(...).as(...))
      .select(explode(...).as(...))
      .groupBy(...)
      .agg(sum(...).as(...))
      .groupBy(...)
      .agg(collectMetrics(...).as(...))

The errors occur in the collectMetrics UDAF in the following snippet

key match {
  case "profiles.total" => updateMetrics(...)
  case "profiles.biz" => updateMetrics(...)
  case ProfileAttrsRegex(...) => updateMetrics(...)
}

... and I'm absolutely ok with scala.MatchError because there is no
"catch all" case in the pattern matching expression, but the strings
containing corrupted characters seem to be very strange.

I've found the following jira issues, but it's hardly difficult to say
whether they are related to my case:
- https://issues.apache.org/jira/browse/SPARK-22092
- https://issues.apache.org/jira/browse/SPARK-23512

So I'm wondering, has anybody ever seen such kind of behaviour and
what could be the problem?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

DataFrames :: Corrupted Data

Reply via email to