nevermind my last email. res2 is filtered so my test does not make sense. The 
issue is not reproduced there. I have the problem somwhere else.

From: Ellafi, Saif A.
Sent: Thursday, October 22, 2015 12:57 PM
To: 'Xiao Li'
Cc: user
Subject: RE: Spark groupby and agg inconsistent and missing data

Thanks, sorry I cannot share the data and not sure how much significant it will 
be for you.
I am reproducing the issue on a smaller piece of the content and see wether I 
find a reason on the inconsistence.

val res2 = data.filter($"closed" === $"ever_closed").groupBy("product", "band 
", "aget", "vine", "time", "yyyymm").agg(count($"account_id").as("N"), 
sum($"balance").as("balance"), sum($"spend").as("spend"), 
sum($"payment").as("payment")).persist()

then I collect distinct values of “vine” (which is StringType) both from data 
and res2, and res2 is missing a lot of values:

val t1 = res2.select("vine").distinct.collect
scala> t1.size
res10: Int = 617

val t_real = data.select("vine").distinct.collect
scala> t_real.size
res9: Int = 639


From: Xiao Li [mailto:gatorsm...@gmail.com]
Sent: Thursday, October 22, 2015 12:45 PM
To: Ellafi, Saif A.
Cc: user
Subject: Re: Spark groupby and agg inconsistent and missing data

Hi, Saif,

Could you post your code here? It might help others reproduce the errors and 
give you a correct answer.

Thanks,

Xiao Li

2015-10-22 8:27 GMT-07:00 
<saif.a.ell...@wellsfargo.com<mailto:saif.a.ell...@wellsfargo.com>>:
Hello everyone,

I am doing some analytics experiments under a 4 server stand-alone cluster in a 
spark shell, mostly involving a huge database with groupBy and aggregations.

I am picking 6 groupBy columns and returning various aggregated results in a 
dataframe. GroupBy fields are of two types, most of them are StringType and the 
rest are LongType.

The data source is a splitted json file dataframe,  once the data is persisted, 
the result is consistent. But if I unload the memory and reload the data, the 
groupBy action returns different content results, missing data.

Could I be missing something? this is rather serious for my analytics, and not 
sure how to properly diagnose this situation.

Thanks,
Saif


Reply via email to