Re: DataFrames :: Corrupted Data

2018-03-28 Thread Sergey Zhemzhitsky
I suppose that it's hardly possible that this issue is connected with
the string encoding, because

- "pr^?files.10056.10040" should be "profiles.10056.10040" and is
defined as constant in the source code
- 
"profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@"
should no occur in exception at all, because such a strings are not
created within the job
- the strings being corrupted are defined within the job and there are
no such input data
- when yarn restarts the job for the second time after the first
failure, the job completes successfully




On Wed, Mar 28, 2018 at 10:31 PM, Jörn Franke  wrote:
> Encoding issue of the data? Eg spark uses utf-8 , but source encoding is 
> different?
>
>> On 28. Mar 2018, at 20:25, Sergey Zhemzhitsky  wrote:
>>
>> Hello guys,
>>
>> I'm using Spark 2.2.0 and from time to time my job fails printing into
>> the log the following errors
>>
>> scala.MatchError:
>> profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@
>> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
>> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
>> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
>> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
>>
>> The job itself looks like the following and contains a few shuffles and UDAFs
>>
>> val df = spark.read.avro(...).as[...]
>>  .groupBy(...)
>>  .agg(collect_list(...).as(...))
>>  .select(explode(...).as(...))
>>  .groupBy(...)
>>  .agg(sum(...).as(...))
>>  .groupBy(...)
>>  .agg(collectMetrics(...).as(...))
>>
>> The errors occur in the collectMetrics UDAF in the following snippet
>>
>> key match {
>>  case "profiles.total" => updateMetrics(...)
>>  case "profiles.biz" => updateMetrics(...)
>>  case ProfileAttrsRegex(...) => updateMetrics(...)
>> }
>>
>> ... and I'm absolutely ok with scala.MatchError because there is no
>> "catch all" case in the pattern matching expression, but the strings
>> containing corrupted characters seem to be very strange.
>>
>> I've found the following jira issues, but it's hardly difficult to say
>> whether they are related to my case:
>> - https://issues.apache.org/jira/browse/SPARK-22092
>> - https://issues.apache.org/jira/browse/SPARK-23512
>>
>> So I'm wondering, has anybody ever seen such kind of behaviour and
>> what could be the problem?
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: DataFrames :: Corrupted Data

2018-03-28 Thread Jörn Franke
Encoding issue of the data? Eg spark uses utf-8 , but source encoding is 
different?

> On 28. Mar 2018, at 20:25, Sergey Zhemzhitsky  wrote:
> 
> Hello guys,
> 
> I'm using Spark 2.2.0 and from time to time my job fails printing into
> the log the following errors
> 
> scala.MatchError:
> profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@
> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
> 
> The job itself looks like the following and contains a few shuffles and UDAFs
> 
> val df = spark.read.avro(...).as[...]
>  .groupBy(...)
>  .agg(collect_list(...).as(...))
>  .select(explode(...).as(...))
>  .groupBy(...)
>  .agg(sum(...).as(...))
>  .groupBy(...)
>  .agg(collectMetrics(...).as(...))
> 
> The errors occur in the collectMetrics UDAF in the following snippet
> 
> key match {
>  case "profiles.total" => updateMetrics(...)
>  case "profiles.biz" => updateMetrics(...)
>  case ProfileAttrsRegex(...) => updateMetrics(...)
> }
> 
> ... and I'm absolutely ok with scala.MatchError because there is no
> "catch all" case in the pattern matching expression, but the strings
> containing corrupted characters seem to be very strange.
> 
> I've found the following jira issues, but it's hardly difficult to say
> whether they are related to my case:
> - https://issues.apache.org/jira/browse/SPARK-22092
> - https://issues.apache.org/jira/browse/SPARK-23512
> 
> So I'm wondering, has anybody ever seen such kind of behaviour and
> what could be the problem?
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org