Re: DataFrames :: Corrupted Data

Jörn Franke Wed, 28 Mar 2018 12:31:44 -0700

Encoding issue of the data? Eg spark uses utf-8 , but source encoding is 
different?


> On 28. Mar 2018, at 20:25, Sergey Zhemzhitsky <szh.s...@gmail.com> wrote:
> 
> Hello guys,
> 
> I'm using Spark 2.2.0 and from time to time my job fails printing into
> the log the following errors
> 
> scala.MatchError:
> profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@
> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
> 
> The job itself looks like the following and contains a few shuffles and UDAFs
> 
> val df = spark.read.avro(...).as[...]
>      .groupBy(...)
>      .agg(collect_list(...).as(...))
>      .select(explode(...).as(...))
>      .groupBy(...)
>      .agg(sum(...).as(...))
>      .groupBy(...)
>      .agg(collectMetrics(...).as(...))
> 
> The errors occur in the collectMetrics UDAF in the following snippet
> 
> key match {
>  case "profiles.total" => updateMetrics(...)
>  case "profiles.biz" => updateMetrics(...)
>  case ProfileAttrsRegex(...) => updateMetrics(...)
> }
> 
> ... and I'm absolutely ok with scala.MatchError because there is no
> "catch all" case in the pattern matching expression, but the strings
> containing corrupted characters seem to be very strange.
> 
> I've found the following jira issues, but it's hardly difficult to say
> whether they are related to my case:
> - https://issues.apache.org/jira/browse/SPARK-22092
> - https://issues.apache.org/jira/browse/SPARK-23512
> 
> So I'm wondering, has anybody ever seen such kind of behaviour and
> what could be the problem?
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: DataFrames :: Corrupted Data

Reply via email to