I suppose that it's hardly possible that this issue is connected with the string encoding, because
- "pr^?files.10056.10040" should be "profiles.10056.10040" and is defined as constant in the source code - "profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@" should no occur in exception at all, because such a strings are not created within the job - the strings being corrupted are defined within the job and there are no such input data - when yarn restarts the job for the second time after the first failure, the job completes successfully On Wed, Mar 28, 2018 at 10:31 PM, Jörn Franke <jornfra...@gmail.com> wrote: > Encoding issue of the data? Eg spark uses utf-8 , but source encoding is > different? > >> On 28. Mar 2018, at 20:25, Sergey Zhemzhitsky <szh.s...@gmail.com> wrote: >> >> Hello guys, >> >> I'm using Spark 2.2.0 and from time to time my job fails printing into >> the log the following errors >> >> scala.MatchError: >> profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@ >> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String) >> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String) >> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String) >> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String) >> >> The job itself looks like the following and contains a few shuffles and UDAFs >> >> val df = spark.read.avro(...).as[...] >> .groupBy(...) >> .agg(collect_list(...).as(...)) >> .select(explode(...).as(...)) >> .groupBy(...) >> .agg(sum(...).as(...)) >> .groupBy(...) >> .agg(collectMetrics(...).as(...)) >> >> The errors occur in the collectMetrics UDAF in the following snippet >> >> key match { >> case "profiles.total" => updateMetrics(...) >> case "profiles.biz" => updateMetrics(...) >> case ProfileAttrsRegex(...) => updateMetrics(...) >> } >> >> ... and I'm absolutely ok with scala.MatchError because there is no >> "catch all" case in the pattern matching expression, but the strings >> containing corrupted characters seem to be very strange. >> >> I've found the following jira issues, but it's hardly difficult to say >> whether they are related to my case: >> - https://issues.apache.org/jira/browse/SPARK-22092 >> - https://issues.apache.org/jira/browse/SPARK-23512 >> >> So I'm wondering, has anybody ever seen such kind of behaviour and >> what could be the problem? >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org