Encoding issue of the data? Eg spark uses utf-8 , but source encoding is different?
> On 28. Mar 2018, at 20:25, Sergey Zhemzhitsky <szh.s...@gmail.com> wrote: > > Hello guys, > > I'm using Spark 2.2.0 and from time to time my job fails printing into > the log the following errors > > scala.MatchError: > profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@ > scala.MatchError: pr^?files.10056.10040 (of class java.lang.String) > scala.MatchError: pr^?files.10056.10040 (of class java.lang.String) > scala.MatchError: pr^?files.10056.10040 (of class java.lang.String) > scala.MatchError: pr^?files.10056.10040 (of class java.lang.String) > > The job itself looks like the following and contains a few shuffles and UDAFs > > val df = spark.read.avro(...).as[...] > .groupBy(...) > .agg(collect_list(...).as(...)) > .select(explode(...).as(...)) > .groupBy(...) > .agg(sum(...).as(...)) > .groupBy(...) > .agg(collectMetrics(...).as(...)) > > The errors occur in the collectMetrics UDAF in the following snippet > > key match { > case "profiles.total" => updateMetrics(...) > case "profiles.biz" => updateMetrics(...) > case ProfileAttrsRegex(...) => updateMetrics(...) > } > > ... and I'm absolutely ok with scala.MatchError because there is no > "catch all" case in the pattern matching expression, but the strings > containing corrupted characters seem to be very strange. > > I've found the following jira issues, but it's hardly difficult to say > whether they are related to my case: > - https://issues.apache.org/jira/browse/SPARK-22092 > - https://issues.apache.org/jira/browse/SPARK-23512 > > So I'm wondering, has anybody ever seen such kind of behaviour and > what could be the problem? > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org